Advanced Plotting
Plotting arrays with randomly generated numbers is great for learning, but the real fun comes when you can visualize large sets of data. In this video, you’ll be working with a very large file that contains macroeconomic California housing data.
You’re going to use numpy
to extract only what you need into one-dimensional arrays. Then, you’ll plot that data with matplotlib
and learn more about advanced grid spacing.
00:00 Plotting arrays of randomly generated numbers is great for learning, but the real fun comes when we can visualize large sets of real data. In this video, we’ll be working with a very large file that contains macroeconomic California housing data.
00:19 To give you an idea of just how big this file is, it’s over 2 million bytes of just plain text. We’re going to use NumPy to extract only what we need into one-dimensional arrays, and then we’ll plot that data with Matplotlib and learn more about advanced grid spacing.
00:39
I’m here in a file called plot3.py
, and as you can see, I’ve already written some code. Because this is a course on Matplotlib, I’m not going to explain this code line by line.
00:52
Basically, all this code does is download a TAR file, or an archive file, from the internet. It then extracts that file and fetches the cal_housing.data
file, which is that crazy large file that we saw before. Then I use numpy
to load the file into a two-dimensional ndarray
, where the first dimension is a group of every line and the second dimension is each line in the file separated by the comma (','
) delimiter.
01:26
But now I need to extract only certain data from each line, and so I take the last element from each line—or the inner dimension of the housing
array—and I store it in a new ndarray
called value
.
01:42
I do something similar with pop
, short for population, and age
, which are index 4
and 7
, respectively.
01:51
I transpose this data so that I end up with two separate ndarrays—one for population and one for age. At this point, I would highly, highly recommend writing this code in your editor and inspecting the elements of housing
, value
, pop
, and age
for yourself.
02:13
It will help you to understand how numpy
is used to manipulate the data file. This is one of those things that you can really only understand by doing it yourself, especially if you’re not experienced with numpy
or file archives.
02:30
At this point, we have the data we need in three separate one-dimensional ndarrays, value
, pop
, and age
. Before we start plotting, I’m going to define a new function called add_innerbox()
that we can use to add an inner box to any of our Axes
we’ll create later. Rather than being a traditional title that lives above the axes, this title box will literally sit inside of the box representing the Axes
.
03:02
This function will take the Axes
to manipulate as well as the text to put in the inner box.
03:11
I’ll call the .text()
method on the Axes
object and I’ll give some x and y coordinates for the positioning of the box.
03:20
I’ll give it the text to apply, and now we just have to set a few more properties so our text is easy to read. First is horizontalalignment
, which I will set to 'center'
.
03:34
Next is the transform
mode, which I will set to ax.transAxes
. This will make the coordinates we entered earlier relative to the bounds of the Axes
.
03:47
Next is the bbox
, which is a dictionary specifying the face color and the opacity, or the value of the alpha channel. Finally, I’ll set the fontsize
to 12.5
. And that’s it for the function.
04:04
We want to create a layout with one big axes at the top, and two smaller ones at the bottom. That will look something like this. What we actually have here is a 3 by 2 grid where ax1
is twice the height and width of ax2
and ax3
, meaning that it takes up two columns and two rows. To create this, I’ll start by defining a new tuple called gridsize
, which I will set to (3, 2)
. In the past, we’ve used the subplots()
function to get both the Figure
and the Axes
. For this plot, I’m going to use the figure()
function to get just the Figure
object. That looks like this: fig = plt.figure()
and I’ll give it a figsize
of 12
by 8
, four times the gridsize
.
05:05
Now we can grab each Axes
individually with the subplot2grid()
function. I’ll write ax1 = plt.subplot2grid()
, passing in the gridsize
, the coordinates for this Axes
, the colspan
, and the rowspan
. ax1
is the big Axes
at the top, so it’s occupying a total of four grid spaces.
05:35
We can create ax2
and ax3
in a similar way, except we don’t need to manually specify colspan
and rowspan
.
05:45
That will default to 1
by 1
.
05:49
Now that we’ve got our Axes
created and positioned, we can modify them as usual. I’ll start by setting the title of the big Axes
and I’ll make it 'Home value as a function of home age (x) & area population (y)'
, with a fontsize
of 14
.
06:11
I’ll create a new variable called sctr
(scatter) and set that equal to ax1.scatter()
, passing in age
for the x
data, pop
for the y
data, value
for the color, and 'RdYlGn'
(Red-Yellow-Green) for the cmap
(color map) .
06:32
This returns a collection that we can use to set the color bar. The color bar is called directly on the Figure
, so I’ll write plt.colorbar()
passing in the sctr
variable, ax1
for the Axes
, and a format
of '$%d'
, which means '$<integer>'
.
06:58
I’ll also set the y scale for the Axes
to logarithmic, which will help to make the data a little bit easier to interpret. All that’s left to do is configure ax2
and ax3
.
07:12
I’ll make both of these histograms, using the age
and pop
ndarrays respectively. I’ll set the bins
to autoconfigure, and I’ll make ax3
use logarithmic scale. Finally, let’s add inner boxes to both ax2
and ax3
using our add_innerbox()
function we defined earlier. The first one will say 'Histogram: home age'
and the second will say 'Histogram: area population (log scl.)'
.
07:49
And as always, we will use plt.show()
to show the figure onscreen. And look at that! Exactly what we were expecting: expensive California housing.
ab on Feb. 12, 2021
Hey, I get the following error. Does someone know how to fix it? Thanks!
url = 'https://ndownloader.figshare.com/files/5976036'
b = BytesIO(urlopen(url).read())
fpath = 'CaliforniaHousing/cal_housing.data'
with tarfile.open(mode='r', fileobj=b) as archive:
housing = np.loadtxt(archive.extractfile(fpath), delimiter=',')
value = housing[:, -1]
pop, age = housing[:, [4, 7]].T
def add_innerbox(ax, text):
ax.text(.55, .8, text,
horizontalalignment='center',
transform=ax.transAxes,
bbox=dict(facecolor='white', alpha=0.6),
fontsize=12.5)
gridsize = (3,2)
fig = plt.figure(figsize=(12,8))
ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)
ax2 = plt.subplot2grid(gridsize, (2,0))
ax3 = plt.subplot2grid(gridsize, (2,1))
ax1.set_title('Home value as a function of home age(x) & area population (y)',
fontsize=14)
sctr = ax1.scatter(x=age, y=pop, c=value, cmap ='RdYlGn')
plt.colorbar(sctr, ax=ax1, format='$%d')
ax1.set_yscale('log')
ax2.hist(age, bins='auto')
ax3.hist(pop, bins='auto', log=True)
add_innerbox(ax2, 'Histogram: home age')
add_innerbox(ax3, 'Histogram: area population(log scl.)')
plt.show()
---------------------------------------
File "<ipython-input-75-5d4925586988>", line 24
ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)
^
SyntaxError: invalid syntax
Bartosz Zaczyński RP Team on Feb. 12, 2021
@ab It looks like you’ve got a missing comma.
Expected:
ax1 = plt.subplot2grid(gridsize, (0,0), colspan=2, rowspan=2)
Actual:
ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)
alberto10024 on Feb. 18, 2021
Hi..how do I modify the code to run it in a notebook? When I run the cell, with the last line:
add_titlebox(ax3, 'Histogram: area population (log scl.)')
I get a single empty object <AxesSubplot:>
Thanks!
alberto10024 on Feb. 20, 2021
Never mind about my earlier question - I sorted it (there seem to have been a conflict with the code i entered earlier) Thanks!
yennjang on April 30, 2021
Hi, just want to highlight that there is a missing comma in the video on the following line:
Actual:
ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)
Expected:
ax1 = plt.subplot2grid(gridsize, (0,0), colspan=2, rowspan=2)
I just figured it out because the line with missing comma just won’t run on my Jupyter notebook. So I looked through the Matplotlib documentations and found out that there should be a comma. Good learning experience though, realizing that we should be referring to Matplotlib documentations while attempting this course, and not just rely on the video alone.
Bartosz Zaczyński RP Team on April 30, 2021
@yennjang Thanks for catching this 😊
Dawn0fTime on July 29, 2021
FYI this may fail initially on the Mac due to an SSL error. Open the Python folder for whichever version you’re using under Applications. Double-click ‘Install Certificates.commmand’.
mindconnect dot cc on March 31, 2023
I got SSLCertVerificationError
on my mac for the line
b = BytesIO(urlopen(url).read())
I fixed the problem thanks DawnOfTime as follows:
Open Finder and head over to /Applications/Python 3.*
, and double click on ‘Install Certificates.command’.
Become a Member to join the conversation.
rinafleisch on Jan. 5, 2020
Thank you for explaining everything so well in regard to making more complex plots. It was extremely useful. I wanted mention that I was curious as to why the newer homes had less value than the older homes. I had a look at the cal.housing.domain for a key to the entries, and it looks like what is actually being plotted is home value as a function of area median income (x, thousands?) and area total bedrooms (y). Nevertheless, it is a beautiful figure.