Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Pandas Tools

Joe Tatusko

Histogram Plotting in Python: NumPy, Matplotlib, Pandas & Seaborn Joe Tatusko 05:59

In addition to its plotting tools, Pandas also offers a convenient .value_counts() method that computes a histogram of non-null values to a Pandas Series:

>>> import pandas as pd

>>> data = np.random.choice(np.arange(10), size=10000,
...                         p=np.linspace(1, 11, 10) / 60)
>>> s = pd.Series(data)

>>> s.value_counts()
9    1831
8    1624
7    1423
6    1323
5    1089
4     888
3     770
2     535
1     347
0     170
dtype: int64

>>> s.value_counts(normalize=True).head()
9    0.1831
8    0.1624
7    0.1423
6    0.1323
5    0.1089
dtype: float64

Elsewhere, pandas.cut() is a convenient way to bin values into arbitrary intervals. Let’s say you have some data on ages of individuals and want to bucket them sensibly:

>>> ages = pd.Series(
...     [1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf)  # The edges
>>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
>>> groups = pd.cut(ages, bins=bins, labels=labels)

>>> groups.value_counts()
child           6
adult           5
teen            3
military_age    2
preteen         1
dtype: int64

>>> pd.concat((ages, groups), axis=1).rename(columns={0: 'age', 1: 'group'})
    age         group
0     1         child
1     1         child
2     3         child
3     5         child
4     8         child
5    10         child
6    12       preteen
7    15          teen
8    18          teen
9    18          teen
10   19  military_age
11   20  military_age
12   25         adult
13   30         adult
14   40         adult
15   51         adult
16   52         adult

What’s nice is that both of these operations ultimately utilize Cython code that makes them competitive on speed while maintaining their flexibility.

00:00 Now that you can use a variety of graphing libraries for your histograms, we’ll cover some tools available in Pandas to give you some more control over them.

00:10 The first one is the .value_counts() method, which computes a histogram from your data and turns it into a Pandas Series. So from a little dataset, which you can just say np.random.choice(),

00:27 and then do a NumPy arange(),

00:35 10000 values,

00:43 and set p=np.linspace() from 1 to 11, and 10 vals, then divide that by 60. And then say s is just going to equal a Pandas Series() from data.

01:12 You can go ahead and print s and then just call .value_counts(). And when you take a look at this, and if I open this up, you can see the frequency of each value and how often it appears in that dataset. This is similar to before, when we turned these into dictionaries.

01:37 But by turning them into Pandas Series, you get a couple more options. Because they’re a Series, you’re free to use any method that you would use on a regular Pandas Series.

01:50 So if I just called .head() on that, you can see that only the first five results show up now. Another nice built-in thing to .value_counts() is the ability to normalize the data, which, if you set that equal to True,

02:17 just goes ahead and normalizes everything from 0 to 1. The big thing about .value_counts() is that it returns that Pandas Series, which gives you a lot of flexibility for any further processing or graphing that you need to do with that data. Another tool in Pandas is Pandas.cut().

02:34 I’m going ahead and make a Series called ages. It’s going to have quite a bit going on in here. We’ll say [1, 1]…

03:15 And actually, bring this over. And then assign bins to a list.

03:40 Define some labels.

04:01 And then put these ages into the groups. So you can just say groups = pd.cut(), pass in ages, set the bins=bins, and the labels=labels.

04:19 Then I’m just going to go ahead and take groups and print the .value_counts() from that. But before I run that, let’s take a look at what’s happening here.

04:28 You can see that you have six different labels and seven different bins. These bins are actually the bin edges. So a 'child' would be from 0 to 10, 'preteen' would be from 10 to 13, and so on.

04:43 Calling cut() here will then assign each of the ages in this Series to the bin that they belong to. And then using .value_counts() will print it out.

04:54 So let’s take a look and see. Bring this up. And yeah, everything’s been categorized correctly. You’ve got 6 children, 5 adults, 3 teens, and so on.

05:12 Everything we’ve looked at up until this point has arbitrarily set the bins based on the dataset, and this makes sure that all the bins are the same size based on default values or the number of bins that you specify.

05:26 Using Pandas.cut() allows you to set your own bin sizes, which is very useful for things like this, where ages don’t necessarily fall into even ranges. All right!

05:37 Now you know a couple different ways to make histograms, how to plot those histograms, and some different tools in Pandas to change up how you actually produce those histograms.

05:46 This is quite a lot of info to take in, so in the next video, we’re going to summarize everything we talked about and talk about which applications are best for each method. Thanks for watching.

Become a Member to join the conversation.