Statistical Transformations: factor()
00:00 In this lesson, I want to show you another statistical transformation, specifically the factor, and how you can use it.
00:07
Let’s start again with the same dataset. from plotnine import
… Well, plotnine.data
, import the huron
dataset, and then you also want to get box plot,
00:22
geom_boxplot
. Now, if you think of this data again, huron.head()
, you remember that it has three columns. It has the year, it has the water level, and then it has the decade that this year is in.
00:36 You didn’t do anything with the decade so far, but in this example, we’re going to look at this decade.
00:42
And if you want to run a box plot where you compare decade
and level
—like, you put those two together, it might be interesting to see how did the water levels change in the specific decades.
00:53
So you might want to go say… Let’s create a plot for this quick. I’m going to say ggplot(huron)
. I’m going to add the aesthetics, where I’ll say x="decade"
and y=
the water level.
01:12 And then let’s make a box plot.
01:17 Okay, so this is kind of a disappointing result that doesn’t really tell you anything. And as you can see, plotnine didn’t really know what to do with these decades that you inputted, because it just keeps going. You know, like it goes for 100 years if I remember correctly.
01:33 It goes from 1870 to values up to 1970, but the data isn’t grouped specifically following these. So, what plotnine does by default here is it just lumps all the data together and gives you this one big box plot that just spans all of the 100 years, which doesn’t give you much information and isn’t really what you’re looking for.
01:54
So what you can do here is apply a statistical transformation right inside here. When you’re assigning the aesthetics to the x-axis, you can say "factor(decade)"
.
02:04
So this is how you can also apply a statistical transformation where it’s going to say, it’s going to consider all of these decades’ ordinal values and group all of the levels together by decade. So just by adding this factor()
, you can see that plotnine can now recognize these as specific individual bins, so to say, and just display the data for each of those decades.
02:28 And now you can see an interesting variability where you can actually see the trend of how did the water levels change over the past 100 years. So, this is a useful thing to know, that you can apply statistical transformations like this.
02:42
And you might be wondering, “What’s actually happening underneath this shiny API that plotnine provides?” And you might be looking for the factor()
function, and if you figure out that pandas underlies this whole structure that is built up here, you won’t find this factor()
function in there because the whole API of plotnine is based on ggplot
, which is, again, based on the R programming language.
03:05 So, you might want to know, “How does this translate into pandas?” You’re going to explore this a little bit in the next lesson.
Become a Member to join the conversation.