Advanced Plotting

Python Plotting With Matplotlib Austin Cepalia 08:03

Plotting arrays with randomly generated numbers is great for learning, but the real fun comes when you can visualize large sets of data. In this video, you’ll be working with a very large file that contains macroeconomic California housing data.

You’re going to use numpy to extract only what you need into one-dimensional arrays. Then, you’ll plot that data with matplotlib and learn more about advanced grid spacing.

00:00 Plotting arrays of randomly generated numbers is great for learning, but the real fun comes when we can visualize large sets of real data. In this video, we’ll be working with a very large file that contains macroeconomic California housing data.

00:19 To give you an idea of just how big this file is, it’s over 2 million bytes of just plain text. We’re going to use NumPy to extract only what we need into one-dimensional arrays, and then we’ll plot that data with Matplotlib and learn more about advanced grid spacing.

00:39 I’m here in a file called plot3.py, and as you can see, I’ve already written some code. Because this is a course on Matplotlib, I’m not going to explain this code line by line.

00:52 Basically, all this code does is download a TAR file, or an archive file, from the internet. It then extracts that file and fetches the cal_housing.data file, which is that crazy large file that we saw before. Then I use numpy to load the file into a two-dimensional ndarray, where the first dimension is a group of every line and the second dimension is each line in the file separated by the comma (',') delimiter.

01:26 But now I need to extract only certain data from each line, and so I take the last element from each line—or the inner dimension of the housing array—and I store it in a new ndarray called value.

01:42 I do something similar with pop, short for population, and age, which are index 4 and 7, respectively.

01:51 I transpose this data so that I end up with two separate ndarrays—one for population and one for age. At this point, I would highly, highly recommend writing this code in your editor and inspecting the elements of housing, value, pop, and age for yourself.

02:13 It will help you to understand how numpy is used to manipulate the data file. This is one of those things that you can really only understand by doing it yourself, especially if you’re not experienced with numpy or file archives.

02:30 At this point, we have the data we need in three separate one-dimensional ndarrays, value, pop, and age. Before we start plotting, I’m going to define a new function called add_innerbox() that we can use to add an inner box to any of our Axes we’ll create later. Rather than being a traditional title that lives above the axes, this title box will literally sit inside of the box representing the Axes.

03:02 This function will take the Axes to manipulate as well as the text to put in the inner box.

03:11 I’ll call the .text() method on the Axes object and I’ll give some x and y coordinates for the positioning of the box.

03:20 I’ll give it the text to apply, and now we just have to set a few more properties so our text is easy to read. First is horizontalalignment, which I will set to 'center'.

03:34 Next is the transform mode, which I will set to ax.transAxes. This will make the coordinates we entered earlier relative to the bounds of the Axes.

03:47 Next is the bbox, which is a dictionary specifying the face color and the opacity, or the value of the alpha channel. Finally, I’ll set the fontsize to 12.5. And that’s it for the function.

04:04 We want to create a layout with one big axes at the top, and two smaller ones at the bottom. That will look something like this. What we actually have here is a 3 by 2 grid where ax1 is twice the height and width of ax2 and ax3, meaning that it takes up two columns and two rows. To create this, I’ll start by defining a new tuple called gridsize, which I will set to (3, 2). In the past, we’ve used the subplots() function to get both the Figure and the Axes. For this plot, I’m going to use the figure() function to get just the Figure object. That looks like this: fig = plt.figure() and I’ll give it a figsize of 12 by 8, four times the gridsize.

05:05 Now we can grab each Axes individually with the subplot2grid() function. I’ll write ax1 = plt.subplot2grid(), passing in the gridsize, the coordinates for this Axes, the colspan, and the rowspan. ax1 is the big Axes at the top, so it’s occupying a total of four grid spaces.

05:35 We can create ax2 and ax3 in a similar way, except we don’t need to manually specify colspan and rowspan.

05:45 That will default to 1 by 1.

05:49 Now that we’ve got our Axes created and positioned, we can modify them as usual. I’ll start by setting the title of the big Axes and I’ll make it 'Home value as a function of home age (x) & area population (y)', with a fontsize of 14.

06:11 I’ll create a new variable called sctr (scatter) and set that equal to ax1.scatter(), passing in age for the x data, pop for the y data, value for the color, and 'RdYlGn' (Red-Yellow-Green) for the cmap (color map) .

06:32 This returns a collection that we can use to set the color bar. The color bar is called directly on the Figure, so I’ll write plt.colorbar() passing in the sctr variable, ax1 for the Axes, and a format of '$%d', which means '$<integer>'.

06:58 I’ll also set the y scale for the Axes to logarithmic, which will help to make the data a little bit easier to interpret. All that’s left to do is configure ax2 and ax3.

07:12 I’ll make both of these histograms, using the age and pop ndarrays respectively. I’ll set the bins to autoconfigure, and I’ll make ax3 use logarithmic scale. Finally, let’s add inner boxes to both ax2 and ax3 using our add_innerbox() function we defined earlier. The first one will say 'Histogram: home age' and the second will say 'Histogram: area population (log scl.)'.

07:49 And as always, we will use plt.show() to show the figure onscreen. And look at that! Exactly what we were expecting: expensive California housing.

rinafleisch on Jan. 5, 2020

Thank you for explaining everything so well in regard to making more complex plots. It was extremely useful. I wanted mention that I was curious as to why the newer homes had less value than the older homes. I had a look at the cal.housing.domain for a key to the entries, and it looks like what is actually being plotted is home value as a function of area median income (x, thousands?) and area total bedrooms (y). Nevertheless, it is a beautiful figure.

ab on Feb. 12, 2021

Hey, I get the following error. Does someone know how to fix it? Thanks!

url = 'https://ndownloader.figshare.com/files/5976036'
b = BytesIO(urlopen(url).read())
fpath = 'CaliforniaHousing/cal_housing.data'

with tarfile.open(mode='r', fileobj=b) as archive:
    housing = np.loadtxt(archive.extractfile(fpath), delimiter=',')

    value = housing[:, -1]
    pop, age = housing[:, [4, 7]].T

def add_innerbox(ax, text):
    ax.text(.55, .8, text,
           horizontalalignment='center',
           transform=ax.transAxes,
           bbox=dict(facecolor='white', alpha=0.6),
           fontsize=12.5)

gridsize = (3,2)
fig = plt.figure(figsize=(12,8))
ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)
ax2 = plt.subplot2grid(gridsize, (2,0))
ax3 = plt.subplot2grid(gridsize, (2,1))

ax1.set_title('Home value as a function of home age(x) & area population (y)',
             fontsize=14)
sctr = ax1.scatter(x=age, y=pop, c=value, cmap ='RdYlGn')
plt.colorbar(sctr, ax=ax1, format='$%d')
ax1.set_yscale('log')
ax2.hist(age, bins='auto')
ax3.hist(pop, bins='auto', log=True)

add_innerbox(ax2, 'Histogram: home age')
add_innerbox(ax3, 'Histogram: area population(log scl.)')
plt.show()
---------------------------------------

File "<ipython-input-75-5d4925586988>", line 24
    ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)
                                                 ^
SyntaxError: invalid syntax

Bartosz Zaczyński RP Team on Feb. 12, 2021

@ab It looks like you’ve got a missing comma.

Expected:

ax1 = plt.subplot2grid(gridsize, (0,0), colspan=2, rowspan=2)

Actual:

ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)

alberto10024 on Feb. 18, 2021

Hi..how do I modify the code to run it in a notebook? When I run the cell, with the last line:

add_titlebox(ax3, 'Histogram: area population (log scl.)')

I get a single empty object <AxesSubplot:>

Thanks!

alberto10024 on Feb. 20, 2021

Never mind about my earlier question - I sorted it (there seem to have been a conflict with the code i entered earlier) Thanks!

yennjang on April 30, 2021

Hi, just want to highlight that there is a missing comma in the video on the following line:

Actual:

ax1 = plt.subplot2grid(gridsize, (0,0) colspan=2, rowspan=2)

Expected:

ax1 = plt.subplot2grid(gridsize, (0,0), colspan=2, rowspan=2)

I just figured it out because the line with missing comma just won’t run on my Jupyter notebook. So I looked through the Matplotlib documentations and found out that there should be a comma. Good learning experience though, realizing that we should be referring to Matplotlib documentations while attempting this course, and not just rely on the video alone.

Bartosz Zaczyński RP Team on April 30, 2021

@yennjang Thanks for catching this 😊

Dawn0fTime on July 29, 2021

FYI this may fail initially on the Mac due to an SSL error. Open the Python folder for whichever version you’re using under Applications. Double-click ‘Install Certificates.commmand’.

mindconnect dot cc on March 31, 2023

I got SSLCertVerificationError on my mac for the line

b = BytesIO(urlopen(url).read())

I fixed the problem thanks DawnOfTime as follows:

Open Finder and head over to /Applications/Python 3.*, and double click on ‘Install Certificates.command’.

Become a Member to join the conversation.