NumPy Histograms
So far, you’ve been working with what could best be called frequency tables. But mathematically, a histogram is a mapping of bins (intervals) to frequencies. More technically, it can be used to approximate the probability density function (PDF) of the underlying variable.
A true histogram first bins the range of values and then counts the number of values that fall into each bin. This is what NumPy’s histogram()
does, and it’s the basis for other functions you’ll see here later in Python libraries such as Matplotlib and Pandas.
Consider a sample of floats drawn from the Laplace distribution. This distribution has fatter tails than a normal distribution and has two descriptive parameters (location and scale):
>>> import numpy as np
>>> # `numpy.random` uses its own PRNG.
>>> np.random.seed(444)
>>> np.set_printoptions(precision=3)
>>> d = np.random.laplace(loc=15, scale=3, size=500)
>>> d[:5]
array([18.406, 18.087, 16.004, 16.221, 7.358])
In this case, you’re working with a continuous distribution, and it wouldn’t be very helpful to tally each float independently, down to the umpteenth decimal place. Instead, you can bin or bucket the data and count the observations that fall into each bin. The histogram is the resulting count of values within each bin.
00:00 To make actual histograms instead of frequency tables, you’ll need to define bins to collect your data points. Bins are simply upper and lower limits that count how many points fall into their range. To make this a bit clearer, we’re going to generate a set of random data using NumPy.
00:17
Go ahead and import numpy as np
. And to make sure you get the same values as I do, because NumPy has its own random number generator, you can set a seed here, which this time do 444
.
00:35
And to make sure things are easily viewable, you can also set a print option to set the precision
to something like 3
decimal points. Okay.
00:48
Now go ahead and make a list using NumPy’s random
, and you’re going to make a Laplace distribution. And set the loc
(location) equal to 15
, the scale
to 3
, and generate 500
data points. And to see what this looks like, you can go ahead and print.
01:13 Let’s just print out the first five values from this distribution. Save this, open up a terminal,
01:29 and as you can see, you can probably imagine there are going to be quite a few individual values here—or unique values—so it wouldn’t make sense to count each occurrence of each value.
01:42
This is where bins come into play. I’m going to delete this print statement here. I’m going to make a hist
and bin_edges
01:55
using NumPy’s histogram()
. And pass in that dataset.
02:05
So now you can print out hist
, and I’ll just print a blank line in between, and then also print out the bin_edges
.
02:21 And let me open this up a little bit.
02:25
Calling this is actually doing a couple things here. This hist
piece, which shows up first, is the frequency counts that would appear for each bin.
02:35
So in this case, the first bin had 13
occurrences, then there were 23
occurrences, 91
and so on. And if you count this, you’ll see that there’s 10 values here.
02:46
The bin_edges
is a little bit different, and this actually shows you where the bin edges are. So you can think, if you had your chart out here, this is where the cuts would be to separate each group of data points.
02:59
And if you count these out, you’ll notice that there are 11 values. And this makes sense because you would need 11 borders to do 10 groups of values. So the first bin would go from 2.11
to 5.874
.
03:14
And then the second bin would be 5.874
to 9.638
. Now you may be wondering “If I have a number that’s something like 5.874, where will it fall? Inside the first bin or in the second bin?” And that’s a good question. With NumPy, it’s going to be inclusive and exclusive.
03:33 So if you had a value that was 5.874, it would be excluded from the first bin and it would be included in the second bin. So the first bin edge will contain any values that equal it, and then that continues on throughout.
03:47 Now, you may be wondering how NumPy determined these bin edges, and the answer is pretty straightforward. NumPy is just looking for the smallest value in the dataset and the largest value in the dataset.
04:00 It then divides that range by 10 to generate 10 equally-spaced bins. All right! Now you know how to use NumPy to generate bins for your dataset and you’re almost ready to start plotting these out.
04:13 The next video will cover using Matplotlib and Pandas to start making some great looking histograms. Thanks for watching.
Become a Member to join the conversation.