# Plotting and Analyzing the Data

**00:00**
One of the first things we may want to do is check the letter grade distribution. This is fairly straightforward. In the `'Final Grade'`

column containing all the letter grades, we may want to do a simple count of how many A’s, how many B’s, how many C’s, and so on. We can use the `.value_counts()`

method

**00:22**
in a Series, and this will just give us a count. The default count is to go from highest to lowest. Now, this is where it will pay off that we set the final grade series as a categorical data.

**00:41**
and it’ll know that `'A'`

is the highest grade, `'B'`

is the second highest, and so on. And so now, things will be sorted from lowest to highest.

**00:52**
Then we can, say, plot this just using a basic histogram. Why don’t we save this as, say, `grade_counts`

and then just call the `.plot()`

method on the `grade_counts`

.

**01:07**
The type of plot that we want is a bar graph. This gives us just a basic letter grade distribution, and we see that nobody got an `'A'`

, nobody got an `'F'`

, and the majority of the grades were a `'C'`

.

**01:23**
Now, to get a better picture of the grade distribution, let’s use the `'Final Score'`

column instead. And in this case, we’ll use a histogram to plot the grade distribution.

**01:34**
So, from the `final_df`

(final DataFrame), we’ve got the `'Final Score'`

. This was the column containing the final scores that were percentages, so less than 1.

**01:46**
We want to plot the data and we want to use a histogram and let’s say `20`

bins. We’re also going to want to compare the grade distribution, say, with the normal distribution and we may also want an estimate of the distribution, so we’re going to superimpose a couple of figures.

**02:06**
Let’s label this one as the `'Grade Distribution'`

. This would be the actual grade distribution. And so that’s what the grade distribution is using the final score.

**02:20**
Let’s compare this with the normal distribution, with the mean and the standard deviation coming from the actual data. Let’s compute the actual grade average, or mean, from the final data.

**02:37**
We’ve got the `'Final Score'`

and then we can use the `.mean()`

method, which will give us the average. Then we’ll also compute the standard deviation,

**02:50**
and so we’ll you use the `.std()`

function. Then we want to use these values with a normal distribution so that we can see how close the actual grade distribution is to being normal.

**03:04**
We’re going to load the `scipy.stats`

module, which contains a function that will give us the values of the normal distribution. We’re going to obtain the values and then we’re going to plot them, so we want to import the `pyplot`

module.

**03:21**
So let’s import `matplotlib`

,

**03:25**
the `pyplot`

module as `plt`

. What we want to do is first generate a set of `x`

points, and we want to do this from the mean of the actual grade, but five standard deviations away so that we’re getting a range.

**03:45**
So we want to go from the `grade_mean`

, `5`

standard deviations to the right, and this should be `grade_std`

. And we want to use `200`

points. The `y`

values, we want to get the values of the normal distribution with this mean and this standard deviation, and that’s where the `scipy.stats`

module comes in.

**04:10**
It has a function called `.pdf()`

, which gives us the actual values of the normal distribution at these `x`

values. Using a mean of the actual `grade_mean`

and and we want to use the standard deviation, which is the `scale`

keyword argument.

**04:30**
We want to use the standard deviation there. We want to plot these `x`

and `y`

values together, or superimposed on top of the histogram, and so we’ll plot `x`

and `y`

. We’ll label this as the `'Normal Distribution'`

,

**04:50**
and maybe we use a `linewidth`

of `3`

. Let’s go ahead and run that. So, there we go. We’ve got the actual grade distribution, and superimposed we’ve got the normal distribution. And we should probably put in a legend here.

**05:11**
Okay. So there, we’ve got the normal distribution and the actual grade distribution with the histogram. Then we could also obtain an estimate of the actual distribution.

**05:23**
For this, we’ll use the `.density()`

function in the `.plot`

attribute of a pandas `Series`

. Let’s come over here and let’s add another plot

**05:37**
using the `'Final Score'`

column. This will be the density plot, so here we’re obtaining what’s called the kernel density estimate, so it’s an estimate of the actual distribution of the grades, so a continuous distribution.

**05:53**
We’ll use a `linewidth`

as well of `3`

.

**05:57**
We’ll call this the `'Kernel Density Estimate'`

.

**06:08**
We see that both the normal distribution and the kernel density estimate do a pretty good job of estimating the grades, and so we could probably say this is a fairly average class in terms of the grade distribution.

**06:24**
So, these are just a couple of things that you may want to do to take a look at some of the statistical analysis of the grade distribution, but overall, the kernel density estimate and the normal distribution do a pretty good job of matching the data. All right, let’s wrap things up with the summary.

Become a Member to join the conversation.