# Kernel Density Estimates

In this course, you’ve been working with samples, statistically speaking. Whether the data is discrete or continuous, it’s assumed to be derived from a population that has a true, exact distribution described by just a few parameters.

A **kernel density estimation** (KDE) is a way to estimate the probability density function (PDF) of the random variable that underlies our sample. KDE is a means of data smoothing.

Sticking with the Pandas library, you can create and overlay density plots using `plot.kde()`

, which is available for both `Series`

and `DataFrame`

objects.

**00:00**
So far, you’ve been looking at sample data, in the sense that it’s not truly representative of the population. In other words, by forcing the data to fit into certain bins, you lose some of the continuity of your data, which might not occur in the full population.

**00:15**
This is where kernel density estimates come into play. Kernel density estimates, or *KDEs*, attempt to solve this problem by estimating the underlying data and basically smoothing out the histogram. To see this in action, you’ll want to create two sets of distinct data.

**00:33**
Create a set of `means`

equal to `10`

and `20`

, and then some `stdevs`

(standard deviations) equal to `4`

and `2`

. Now create a `DataFrame`

called `dist`

, which will be a `pd.DataFrame()`

, and in here, we’re going to pass in `np.random`

and normal distributions,

**01:02**
with `loc`

(location) set on the `means`

, `scale`

set by the `stdevs`

,

**01:14**
a `size`

equal to `1000`

by `2`

, and name these columns `'a'`

and `'b'`

.

**01:30**
And just to see what these look like, you can print out `dist.agg()`

(aggregate)

**01:42**
and get the `'min'`

, `'max'`

, `'mean'`

and `'std'`

(standard deviation).

**01:52**
And go ahead and round these to `2`

decimal points. Alrighty! Let’s take a look.

**02:06**
Cool. So now you have a set of data called `a`

with a minimum of a -2 to 24 and a set called `b`

from 14 to 26. And going by the standard deviations, you could tell that this one should be spread out quite a bit more and should also be smaller than `b`

over here.

**02:28**
But since we’re talking about histograms, let’s just plot these out so we can take a look. Go ahead and delete this. Then you’re going to make a figure and an axes with a `plt.subplot()`

, and you’re going to plot `dist`

two different ways.

**02:50**
So you can plot that KDE, place it on the axes you just made, and no legends necessary, and set the `title`

to `'Histogram: A vs. B'`

.

**03:13**
And then also overlay a histogram on that, where you can set `density=True`

and place it on that axes. And I forgot to add `.plot`

in here, so we’re just going to add that real quick—spelled correctly.

**03:34**
All right. On that axes, set the *y* label equal to `'Probability'`

. Get your grid set up, `axis='y'`

. And just to make things easier to see, set the face color to this hex value here, which will just be `'#d8dcd6'`

, which will be slightly grayer than white.

**04:13**
And, like before, `plt.show()`

.

**04:27**
And look at that! You can see you’ve got your histograms here listed out. They’ve got their bins and their frequency counts. And then you also have these new lines here, and these are those kernel density estimates.

**04:41**
What these are attempting to do is smooth out the non-continuous nature of each of these bins and try to estimate where all the other values would fall into place. A good way to look at this would be, like, this bin here, which looking at this is somewhere from, I don’t know, 21 to 23 and a half?

**05:05**
And based on the bin, you would expect there to be equal probability—based on over here, your *y*-axis—of a value being somewhere on this end of the bin or over here.

**05:17**
But based on the entire histogram, you probably have a good idea that that’s not the case, and you’d have a higher possibility of having a value closer to this end of the bin, than over here.

**05:27**
And that’s what this line is trying to show for you. And that’s it! Kernel density estimates can add a lot to your histograms and there’s a lot of science behind them.

**05:38**
The nice thing about Pandas is it makes it very easy to add one of these to a histogram, so if you feel the need to do so, it’s literally only one more line of code. All right! In the next video, you’re going to learn how to use Seaborn, which is a plotting library that makes some very nice looking graphs.

**alazejha** on May 16, 2021

I went till the very end and the last command produced an error: No module named ‘scipy’, How should I proceed?

**alazejha** on May 16, 2021

found solution!

**Martin Breuss** RP Team on May 17, 2021

Hi @alazejha, great that you found a solution! It’s helpful for future learners if you post your solution here as well.

My assumption is that you probably needed to install `scipy`

:

```
$ python -m pip install scipy
```

**piyushathawale** on June 1, 2021

Is the y axis in kde probability or probability density? I am confused between how to interpret both.

**Dawn0fTime** on Aug. 11, 2021

@Martin Breuss thank you! Yes, installing the scipy module did the trick for me.

Become a Member to join the conversation.

Pygatoron Sept. 16, 2019what does density=True do? and rwidth kwarg from before, don’t know what these do.