Kernel Density Estimates
In this course, you’ve been working with samples, statistically speaking. Whether the data is discrete or continuous, it’s assumed to be derived from a population that has a true, exact distribution described by just a few parameters.
A kernel density estimation (KDE) is a way to estimate the probability density function (PDF) of the random variable that underlies our sample. KDE is a means of data smoothing.
Sticking with the Pandas library, you can create and overlay density plots using plot.kde()
, which is available for both Series
and DataFrame
objects.
00:00 So far, you’ve been looking at sample data, in the sense that it’s not truly representative of the population. In other words, by forcing the data to fit into certain bins, you lose some of the continuity of your data, which might not occur in the full population.
00:15 This is where kernel density estimates come into play. Kernel density estimates, or KDEs, attempt to solve this problem by estimating the underlying data and basically smoothing out the histogram. To see this in action, you’ll want to create two sets of distinct data.
00:33
Create a set of means
equal to 10
and 20
, and then some stdevs
(standard deviations) equal to 4
and 2
. Now create a DataFrame
called dist
, which will be a pd.DataFrame()
, and in here, we’re going to pass in np.random
and normal distributions,
01:02
with loc
(location) set on the means
, scale
set by the stdevs
,
01:14
a size
equal to 1000
by 2
, and name these columns 'a'
and 'b'
.
01:30
And just to see what these look like, you can print out dist.agg()
(aggregate)
01:42
and get the 'min'
, 'max'
, 'mean'
and 'std'
(standard deviation).
01:52
And go ahead and round these to 2
decimal points. Alrighty! Let’s take a look.
02:06
Cool. So now you have a set of data called a
with a minimum of a -2 to 24 and a set called b
from 14 to 26. And going by the standard deviations, you could tell that this one should be spread out quite a bit more and should also be smaller than b
over here.
02:28
But since we’re talking about histograms, let’s just plot these out so we can take a look. Go ahead and delete this. Then you’re going to make a figure and an axes with a plt.subplot()
, and you’re going to plot dist
two different ways.
02:50
So you can plot that KDE, place it on the axes you just made, and no legends necessary, and set the title
to 'Histogram: A vs. B'
.
03:13
And then also overlay a histogram on that, where you can set density=True
and place it on that axes. And I forgot to add .plot
in here, so we’re just going to add that real quick—spelled correctly.
03:34
All right. On that axes, set the y label equal to 'Probability'
. Get your grid set up, axis='y'
. And just to make things easier to see, set the face color to this hex value here, which will just be '#d8dcd6'
, which will be slightly grayer than white.
04:13
And, like before, plt.show()
.
04:27 And look at that! You can see you’ve got your histograms here listed out. They’ve got their bins and their frequency counts. And then you also have these new lines here, and these are those kernel density estimates.
04:41 What these are attempting to do is smooth out the non-continuous nature of each of these bins and try to estimate where all the other values would fall into place. A good way to look at this would be, like, this bin here, which looking at this is somewhere from, I don’t know, 21 to 23 and a half?
05:05 And based on the bin, you would expect there to be equal probability—based on over here, your y-axis—of a value being somewhere on this end of the bin or over here.
05:17 But based on the entire histogram, you probably have a good idea that that’s not the case, and you’d have a higher possibility of having a value closer to this end of the bin, than over here.
05:27 And that’s what this line is trying to show for you. And that’s it! Kernel density estimates can add a lot to your histograms and there’s a lot of science behind them.
05:38 The nice thing about Pandas is it makes it very easy to add one of these to a histogram, so if you feel the need to do so, it’s literally only one more line of code. All right! In the next video, you’re going to learn how to use Seaborn, which is a plotting library that makes some very nice looking graphs.
alazejha on May 16, 2021
I went till the very end and the last command produced an error: No module named ‘scipy’, How should I proceed?
alazejha on May 16, 2021
found solution!
Martin Breuss RP Team on May 17, 2021
Hi @alazejha, great that you found a solution! It’s helpful for future learners if you post your solution here as well.
My assumption is that you probably needed to install scipy
:
$ python -m pip install scipy
piyushathawale on June 1, 2021
Is the y axis in kde probability or probability density? I am confused between how to interpret both.
Dawn0fTime on Aug. 11, 2021
@Martin Breuss thank you! Yes, installing the scipy module did the trick for me.
Become a Member to join the conversation.
Pygator on Sept. 16, 2019
what does density=True do? and rwidth kwarg from before, don’t know what these do.