Anscombe's Quartet Revisited
To follow along at this point in the lesson, you can use the following code:
import pandas as pd
# Anscombe's Quartet
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
I = pd.DataFrame([x, y1], index=["x", "y1"]).T
II = pd.DataFrame([x, y2], index=["x", "y2"]).T
III = pd.DataFrame([x, y3], index=["x", "y3"]).T
IV = pd.DataFrame([x4, y4], index=["x4", "y4"]).T
00:00 Before we dig into this layer by layer, I want to give a quick throwback to this Anscombe’s quartet that you learned about in the first lesson of this course.
00:10 Now, here’s some data that makes up these four different types of plots that all have the same statistical values but very different plots. I want to show you how quickly you can plot these using plotnine.
00:24
Now you can get this data off of the description of this course lesson if you want to run it as well, but you can also just watch. So, I have to import pandas
explicitly because it comes as a dependency with plotnine
but it’s not automatically imported, of course.
00:39
And now you can see I have these datasets and if you .describe()
them, you would see what we saw before.
00:55 You could compare these values and see that they’re very similar—the statistical values—if not the exact same. But now, if you take a different approach and you actually go ahead and visualize these datasets—using plotnine, in this case—you can very quickly see a difference.
01:13
So I need to import from plotnine
, the ggplot
, the aesthetic, and the geometrical object. With ggplot
, with this first one, I can add the data layer, so to say.
01:26
And this is the syntax that you can use. You can say ggplot()
, pass in the data. So here, I’m passing in the pandas DataFrame
as the data layer.
01:37
Then, you’re adding the aesthetics layer, where you define the mappings. From x
is going to map to x
here, and y
is going to map to y1
, in this case.
01:49 So, you want to plot this first dataset.
01:54 And now, if I execute this,
01:57 you can see the plot popping up here. And it looks a certain way, okay. One plot alone doesn’t tell you much yet, but now if you make the second one…
02:07
In the same way, I’m just going to say ggplot()
, but pass in the second dataset. I’m going to say + aes()
(plus aesthetics), where I’m going to map x
to "x"
and y
to "y2"
, in this case.
02:24 And finally, you need to define the geometric objects, and this is just going to be a point plot. So if I run this, you right away see that this data said has a completely different distribution of the values actually.
02:38
So something that was basically impossible to see by just the statistical descriptions, you can very easily distinguish by a quick plot that doesn’t take more than one import
line and then three lines of code for each plot.
02:54 So, you can play around with this a bit more. Also, you can plot the other ones. You can plot number III and number IV and compare them, and if you want, research a little how you can change the colors and size of these dots.
03:08 So, see you in the next lesson, where you’re going to start looking at the data layer in a bit more detail.
Become a Member to join the conversation.