Pandas Tools
In addition to its plotting tools, Pandas also offers a convenient .value_counts()
method that computes a histogram of non-null values to a Pandas Series
:
>>> import pandas as pd
>>> data = np.random.choice(np.arange(10), size=10000,
... p=np.linspace(1, 11, 10) / 60)
>>> s = pd.Series(data)
>>> s.value_counts()
9 1831
8 1624
7 1423
6 1323
5 1089
4 888
3 770
2 535
1 347
0 170
dtype: int64
>>> s.value_counts(normalize=True).head()
9 0.1831
8 0.1624
7 0.1423
6 0.1323
5 0.1089
dtype: float64
Elsewhere, pandas.cut()
is a convenient way to bin values into arbitrary intervals. Let’s say you have some data on ages of individuals and want to bucket them sensibly:
>>> ages = pd.Series(
... [1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf) # The edges
>>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
>>> groups = pd.cut(ages, bins=bins, labels=labels)
>>> groups.value_counts()
child 6
adult 5
teen 3
military_age 2
preteen 1
dtype: int64
>>> pd.concat((ages, groups), axis=1).rename(columns={0: 'age', 1: 'group'})
age group
0 1 child
1 1 child
2 3 child
3 5 child
4 8 child
5 10 child
6 12 preteen
7 15 teen
8 18 teen
9 18 teen
10 19 military_age
11 20 military_age
12 25 adult
13 30 adult
14 40 adult
15 51 adult
16 52 adult
What’s nice is that both of these operations ultimately utilize Cython code that makes them competitive on speed while maintaining their flexibility.
00:00 Now that you can use a variety of graphing libraries for your histograms, we’ll cover some tools available in Pandas to give you some more control over them.
00:10
The first one is the .value_counts()
method, which computes a histogram from your data and turns it into a Pandas Series
. So from a little dataset, which you can just say np.random.choice()
,
00:27
and then do a NumPy arange()
,
00:43
and set p=np.linspace()
from 1
to 11
, and 10
vals, then divide that by 60
. And then say s
is just going to equal a Pandas Series()
from data
.
01:12
You can go ahead and print s
and then just call .value_counts()
. And when you take a look at this, and if I open this up, you can see the frequency of each value and how often it appears in that dataset. This is similar to before, when we turned these into dictionaries.
01:37
But by turning them into Pandas Series
, you get a couple more options. Because they’re a Series
, you’re free to use any method that you would use on a regular Pandas Series
.
01:50
So if I just called .head()
on that, you can see that only the first five results show up now. Another nice built-in thing to .value_counts()
is the ability to normalize the data, which, if you set that equal to True
,
02:17
just goes ahead and normalizes everything from 0 to 1. The big thing about .value_counts()
is that it returns that Pandas Series
, which gives you a lot of flexibility for any further processing or graphing that you need to do with that data. Another tool in Pandas is Pandas.cut()
.
02:34
I’m going ahead and make a Series
called ages
. It’s going to have quite a bit going on in here. We’ll say [1, 1]
…
03:15
And actually, bring this over. And then assign bins
to a list.
04:01
And then put these ages into the groups. So you can just say groups = pd.cut()
, pass in ages
, set the bins=bins
, and the labels=labels
.
04:19
Then I’m just going to go ahead and take groups
and print the .value_counts()
from that. But before I run that, let’s take a look at what’s happening here.
04:28
You can see that you have six different labels and seven different bins. These bins are actually the bin edges. So a 'child'
would be from 0
to 10
, 'preteen'
would be from 10
to 13
, and so on.
04:43
Calling cut()
here will then assign each of the ages in this Series
to the bin that they belong to. And then using .value_counts()
will print it out.
04:54
So let’s take a look and see. Bring this up. And yeah, everything’s been categorized correctly. You’ve got 6
children, 5
adults, 3
teens, and so on.
05:12 Everything we’ve looked at up until this point has arbitrarily set the bins based on the dataset, and this makes sure that all the bins are the same size based on default values or the number of bins that you specify.
05:26
Using Pandas.cut()
allows you to set your own bin sizes, which is very useful for things like this, where ages don’t necessarily fall into even ranges. All right!
05:37 Now you know a couple different ways to make histograms, how to plot those histograms, and some different tools in Pandas to change up how you actually produce those histograms.
05:46 This is quite a lot of info to take in, so in the next video, we’re going to summarize everything we talked about and talk about which applications are best for each method. Thanks for watching.
Become a Member to join the conversation.