In this lesson, you’ll start learning about statistical transformations, another layer that you can add to your plot. You’ll start by looking into data aggregation. For this, you’ll use a new dataset, which you can import again
from plotnine.data import huron, and that’s the dataset that gives information about the lake Huron.
00:29 You can see it holds information about the water levels in specific years and then there’s also a column about the decades that these years are in, but for now you’re not going to work with this.
You’re just going to be interested in the year and the level of the water. Let’s go ahead and make a plot from that. Again, you will need from
ggplot() for the data,
aes() for the aesthetics, and then let’s make a bar plot for this one.
00:58 And you’ll see that I’m going to add in the statistical transformation in just a second, but let’s look first at what this would be like: a plot that we just make, a bar plot without any statistical transformations.
So, we do the dance of adding
ggplot(), the data, plus the aesthetics. And now you actually just need to map one of the aesthetics, so I’m going to map
"level", so to this column here. And because I’m using a bar plot here,
geom_bar(), it will automatically sum up the occurrences of each of these data items. And this is not going to produce a very great plot, as you can see here, because it looks for individual values, and this water level is very variable, so it’s rare that there’s many years that have exactly the same water level.
And this doesn’t really tell you much about the actual distribution of water levels in the lake, and this is where something like a statistical transformation can come in handy. So, you can add another one here and import
and then just add this layer in here. I’m going to put it before the geometrical objects, but you could theoretically put it anywhere. So I’m going to say
stat_bin() and define a bin size of
10, which just means that it’s going to group all of the data that’s in the
level column, so that there’s only ten bins.
02:51 You can see that most often it’s somewhere around just below 580. And then, yeah, you can see this is the second most often bin. So this graph gives you a much better overview of how is the actual distribution of the water levels across the time that this data was recorded.
03:09 This is how aggregation and statistical transformations can come in really handy. And for this specific one—like, working with bins, which is one of the most common statistical transformations that you’re going to apply—there’s actually a specific type of plot also that you can use instead that just abstracts this statistical layer away again.
you can say
geom_histogram() and then pass in the bins right here. So you could say
bins=10 here, and run this, and you get the same plot. So, as you can see, there’s a couple of abstractions going on. And in this case, you can use the geometric object of a histogram and then pass it in the bins, and then it takes away the need to have this additional statistics layer.
04:24 But you can also, obviously, use this one and play around with it. So if I change the amount of bins that you want to see here, you can see that the graph as your output also changes accordingly.
This is how you use statistical transformations for aggregation using plotnine or
ggplot. And in the next lesson, you’re going to look at some more statistical transformations that you can apply.
Become a Member to join the conversation.