Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Statistical Transformations: factor()

00:00 In this lesson, I want to show you another statistical transformation, specifically the factor, and how you can use it.

00:07 Let’s start again with the same dataset. from plotnine import… Well, plotnine.data, import the huron dataset, and then you also want to get box plot,

00:22 geom_boxplot. Now, if you think of this data again, huron.head(), you remember that it has three columns. It has the year, it has the water level, and then it has the decade that this year is in.

00:36 You didn’t do anything with the decade so far, but in this example, we’re going to look at this decade.

00:42 And if you want to run a box plot where you compare decade and level—like, you put those two together, it might be interesting to see how did the water levels change in the specific decades.

00:53 So you might want to go say… Let’s create a plot for this quick. I’m going to say ggplot(huron). I’m going to add the aesthetics, where I’ll say x="decade" and y= the water level.

01:12 And then let’s make a box plot.

01:17 Okay, so this is kind of a disappointing result that doesn’t really tell you anything. And as you can see, plotnine didn’t really know what to do with these decades that you inputted, because it just keeps going. You know, like it goes for 100 years if I remember correctly.

01:33 It goes from 1870 to values up to 1970, but the data isn’t grouped specifically following these. So, what plotnine does by default here is it just lumps all the data together and gives you this one big box plot that just spans all of the 100 years, which doesn’t give you much information and isn’t really what you’re looking for.

01:54 So what you can do here is apply a statistical transformation right inside here. When you’re assigning the aesthetics to the x-axis, you can say "factor(decade)".

02:04 So this is how you can also apply a statistical transformation where it’s going to say, it’s going to consider all of these decades’ ordinal values and group all of the levels together by decade. So just by adding this factor(), you can see that plotnine can now recognize these as specific individual bins, so to say, and just display the data for each of those decades.

02:28 And now you can see an interesting variability where you can actually see the trend of how did the water levels change over the past 100 years. So, this is a useful thing to know, that you can apply statistical transformations like this.

02:42 And you might be wondering, “What’s actually happening underneath this shiny API that plotnine provides?” And you might be looking for the factor() function, and if you figure out that pandas underlies this whole structure that is built up here, you won’t find this factor() function in there because the whole API of plotnine is based on ggplot, which is, again, based on the R programming language.

03:05 So, you might want to know, “How does this translate into pandas?” You’re going to explore this a little bit in the next lesson.

Become a Member to join the conversation.