So far, you’ve been working with what could best be called frequency tables. But mathematically, a histogram is a mapping of bins (intervals) to frequencies. More technically, it can be used to approximate the probability density function (PDF) of the underlying variable.
A true histogram first bins the range of values and then counts the number of values that fall into each bin. This is what NumPy’s histogram()
does, and it’s the basis for other functions you’ll see here later in Python libraries such as Matplotlib and Pandas.
Consider a sample of floats drawn from the Laplace distribution. This distribution has fatter tails than a normal distribution and has two descriptive parameters (location and scale):
>>> import numpy as np
>>> # `numpy.random` uses its own PRNG.
>>> np.random.seed(444)
>>> np.set_printoptions(precision=3)
>>> d = np.random.laplace(loc=15, scale=3, size=500)
>>> d[:5]
array([18.406, 18.087, 16.004, 16.221, 7.358])
In this case, you’re working with a continuous distribution, and it wouldn’t be very helpful to tally each float independently, down to the umpteenth decimal place. Instead, you can bin or bucket the data and count the observations that fall into each bin. The histogram is the resulting count of values within each bin.