Plotting and Analyzing the Data
00:00
One of the first things we may want to do is check the letter grade distribution. This is fairly straightforward. In the 'Final Grade'
column containing all the letter grades, we may want to do a simple count of how many A’s, how many B’s, how many C’s, and so on. We can use the .value_counts()
method
00:22 in a Series, and this will just give us a count. The default count is to go from highest to lowest. Now, this is where it will pay off that we set the final grade series as a categorical data.
00:41
and it’ll know that 'A'
is the highest grade, 'B'
is the second highest, and so on. And so now, things will be sorted from lowest to highest.
00:52
Then we can, say, plot this just using a basic histogram. Why don’t we save this as, say, grade_counts
and then just call the .plot()
method on the grade_counts
.
01:07
The type of plot that we want is a bar graph. This gives us just a basic letter grade distribution, and we see that nobody got an 'A'
, nobody got an 'F'
, and the majority of the grades were a 'C'
.
01:23
Now, to get a better picture of the grade distribution, let’s use the 'Final Score'
column instead. And in this case, we’ll use a histogram to plot the grade distribution.
01:34
So, from the final_df
(final DataFrame), we’ve got the 'Final Score'
. This was the column containing the final scores that were percentages, so less than 1.
01:46
We want to plot the data and we want to use a histogram and let’s say 20
bins. We’re also going to want to compare the grade distribution, say, with the normal distribution and we may also want an estimate of the distribution, so we’re going to superimpose a couple of figures.
02:06
Let’s label this one as the 'Grade Distribution'
. This would be the actual grade distribution. And so that’s what the grade distribution is using the final score.
02:20 Let’s compare this with the normal distribution, with the mean and the standard deviation coming from the actual data. Let’s compute the actual grade average, or mean, from the final data.
02:37
We’ve got the 'Final Score'
and then we can use the .mean()
method, which will give us the average. Then we’ll also compute the standard deviation,
02:50
and so we’ll you use the .std()
function. Then we want to use these values with a normal distribution so that we can see how close the actual grade distribution is to being normal.
03:04
We’re going to load the scipy.stats
module, which contains a function that will give us the values of the normal distribution. We’re going to obtain the values and then we’re going to plot them, so we want to import the pyplot
module.
03:21
So let’s import matplotlib
,
03:25
the pyplot
module as plt
. What we want to do is first generate a set of x
points, and we want to do this from the mean of the actual grade, but five standard deviations away so that we’re getting a range.
03:45
So we want to go from the grade_mean
, 5
standard deviations to the right, and this should be grade_std
. And we want to use 200
points. The y
values, we want to get the values of the normal distribution with this mean and this standard deviation, and that’s where the scipy.stats
module comes in.
04:10
It has a function called .pdf()
, which gives us the actual values of the normal distribution at these x
values. Using a mean of the actual grade_mean
and and we want to use the standard deviation, which is the scale
keyword argument.
04:30
We want to use the standard deviation there. We want to plot these x
and y
values together, or superimposed on top of the histogram, and so we’ll plot x
and y
. We’ll label this as the 'Normal Distribution'
,
04:50
and maybe we use a linewidth
of 3
. Let’s go ahead and run that. So, there we go. We’ve got the actual grade distribution, and superimposed we’ve got the normal distribution. And we should probably put in a legend here.
05:11 Okay. So there, we’ve got the normal distribution and the actual grade distribution with the histogram. Then we could also obtain an estimate of the actual distribution.
05:23
For this, we’ll use the .density()
function in the .plot
attribute of a pandas Series
. Let’s come over here and let’s add another plot
05:37
using the 'Final Score'
column. This will be the density plot, so here we’re obtaining what’s called the kernel density estimate, so it’s an estimate of the actual distribution of the grades, so a continuous distribution.
05:53
We’ll use a linewidth
as well of 3
.
05:57
We’ll call this the 'Kernel Density Estimate'
.
06:08 We see that both the normal distribution and the kernel density estimate do a pretty good job of estimating the grades, and so we could probably say this is a fairly average class in terms of the grade distribution.
06:24 So, these are just a couple of things that you may want to do to take a look at some of the statistical analysis of the grade distribution, but overall, the kernel density estimate and the normal distribution do a pretty good job of matching the data. All right, let’s wrap things up with the summary.
Become a Member to join the conversation.