Statistical Transformations

Graph Your Data With Python and ggplot Martin Breuss 04:49

00:00 In this lesson, you’ll start learning about statistical transformations, another layer that you can add to your plot. You’ll start by looking into data aggregation. For this, you’ll use a new dataset, which you can import again from plotnine.data import huron, and that’s the dataset that gives information about the lake Huron.

00:25 Let’s actually look at it, huron.head().

00:29 You can see it holds information about the water levels in specific years and then there’s also a column about the decades that these years are in, but for now you’re not going to work with this.

00:40 You’re just going to be interested in the year and the level of the water. Let’s go ahead and make a plot from that. Again, you will need from plotnine, ggplot() for the data, aes() for the aesthetics, and then let’s make a bar plot for this one.

00:58 And you’ll see that I’m going to add in the statistical transformation in just a second, but let’s look first at what this would be like: a plot that we just make, a bar plot without any statistical transformations.

01:53 You can see that it’s quite precise with even, like, some commas in there. So you see that there’s, like, the most often that one specific water level occurred was 3 times.

02:03 And this doesn’t really tell you much about the actual distribution of water levels in the lake, and this is where something like a statistical transformation can come in handy. So, you can add another one here and import stat_bin,

02:20 and then just add this layer in here. I’m going to put it before the geometrical objects, but you could theoretically put it anywhere. So I’m going to say stat_bin() and define a bin size of 10, which just means that it’s going to group all of the data that’s in the level column, so that there’s only ten bins.

02:39 And so it’s going to just bunch the data together and you will see how often each of those bins occur, so this gives you a much better overview of how the water levels are actually distributed.

02:51 You can see that most often it’s somewhere around just below 580. And then, yeah, you can see this is the second most often bin. So this graph gives you a much better overview of how is the actual distribution of the water levels across the time that this data was recorded.

03:09 This is how aggregation and statistical transformations can come in really handy. And for this specific one—like, working with bins, which is one of the most common statistical transformations that you’re going to apply—there’s actually a specific type of plot also that you can use instead that just abstracts this statistical layer away again.

03:30 So you can do the same thing by saying from plotnine import geom_histogram,

03:38 and then build the graph like this. I’m going to say ggplot(), pass in the lake Huron data. Add the aesthetics same as before, x="level".

03:52 And now, instead of passing in the stats layer and then the geom_bar() layer,

04:00 you can say geom_histogram() and then pass in the bins right here. So you could say bins=10 here, and run this, and you get the same plot. So, as you can see, there’s a couple of abstractions going on. And in this case, you can use the geometric object of a histogram and then pass it in the bins, and then it takes away the need to have this additional statistics layer.

04:24 But you can also, obviously, use this one and play around with it. So if I change the amount of bins that you want to see here, you can see that the graph as your output also changes accordingly.

04:36 This is how you use statistical transformations for aggregation using plotnine or ggplot. And in the next lesson, you’re going to look at some more statistical transformations that you can apply.

Become a Member to join the conversation.