Merging the Grade DataFrames
To start with the merging, we’re going to work with the roster DataFrame and the homework exam grades DataFrame. Both of these, their indices—so for the roster, it’s
NetID and for the homework exam grades, it’s
SID—they point to the same data, so it’s simply two different names for the same data that they’re storing, the same strings. So this
SID, it’s going to uniquely identify each student, and so we’re going to be merging these two DataFrames on the index.
Let’s go ahead and do that. Maybe what we’ll do is, let’s call it, maybe,
final_data. Or maybe we’ll call it, say,
merge() function in pandas is the function that does the merging. There’s other ways to do it and we’ll talk about that in a second.
We want to merge the
roster and we want to merge the
hw_exam_grades to create a new DataFrame. Now, the default behavior of
merge() is to find a common column on both of the two DataFrames that you want to merge and use the column as the key to line up the rows in each of the DataFrames to do the merging.
So essentially, it’s going to be a concatenation of the two DataFrames to create a new DataFrame. Now, both of these DataFrames—they don’t have a common column. If you recall, the
roster only has the email address and the section of the student, whereas the
hw_exam_grades has the first name, last name, and then all of the homework scores and all of the exam scores.
01:44 So when I run this, we’re going to get an error, but I just want you to see what happens.
01:49 The problem is, like we said—if I scroll down here at the bottom—there’s no common column to perform the merge on, right? So these DataFrames don’t have a common column.
So we need to specify the key that should be used to line up or to match up the data from the two DataFrames. Now, if they have more than one common column, then we need to specify either a
left_on column name for the left DataFrame—so that would be the first one in the arguments to the
merge() function—and then the right column would be the second one that we’re passing into the
merge() function and then we would specify what column.
02:30 So the idea there is that in the left DataFrame, we have some column that matches up with the right DataFrame on some other column, and they don’t have to even be the same name but just point to the same data so that we can do the matching.
02:44 We can also use the indices. We can specify that we want the keys to be either left index—so the index of the left DataFrame—or the right index of the right DataFrame. And we can mix and match, so for example, we can match on the left column and say the right index and so on.
Now, in our case, they do have the same index, so if we merge the two DataFrames on the indices—so
hw_exam_grades—on the left one, we’ve got the
left_index and this is a flag variable and the default is
False. We want to use that as the key for the left one, so we’ll pass in
True. And then same thing for the right DataFrame, we want to use the index as the key.
This will essentially take the rows of each of the two DataFrames and match them based on the same index value, and then simply concatenate and create a new DataFrame. All right, so let’s call this, say,
final_df like we had up there. And, you know, we should probably do this right—let’s get rid of this. And let’s just go back up here and do it right up here.
right_index=True. All right.
04:11 That ran nicely. Let’s take a look at the first five rows.
There, we’ve got
Email Address—that came from the roster and so did the
Section, and the indices are the same as they were. We just matched the two indices with the left and the right, and now we have all the data that came to us from the
hw_exam_grades DataFrame: the first name and the last name of the student and then all of the homework and all of the exam grades.
04:40 Because it’s pretty common to match on the indices,
there is also a
.join() method that we can use on a DataFrame with a DataFrame. So if we went
roster.join() and we want to join with the
hw_exam_grades, and then we don’t need any of this because that’s for the merge. All right, so this accomplishes the same thing.
We are joining to the
roster DataFrame the
hw_exam_grades, and the default behavior here is to do it on the indices, which they do have the same.
Let’s run that and let’s run this again, and we get the exact same thing. Okay, so that is the merge for the
roster with the
Now let’s take this
final_df, or final DataFrame, and let’s also now merge the quiz grades.
Let’s quickly just remind ourselves the
quiz_grades DataFrame. The columns are just simply the five quiz grades and the index is the email address, right?
quiz_grades index, email address—we know that in the final DataFrame that we’re just constructing now, we have a column that has all the email addresses, so that’s going to be our key to merge these two.
We’ll use the
It’s going to be that
final_df and we’re going to be merging the
quiz_grades, and for the left DataFrame, which is the
final_df, that’s going to be the key. So
left_on, we’re passing in what column to use as the key for the left DataFrame, the
final_df, it should be the
"Email Address", right?
Let me come back up here. That is the name,
"Email Address". And then for the right one, it’s the index. We want the key for the right one to be the right index, so we’ll pass that in as a
True value for that flag.
And that is it! So we’ll rename, or we’ll reconstruct, this
final_df DataFrame and we’ll run that, and that looks good. And then let’s just take a look at our final DataFrame.
07:09 All right, and so there we go.
07:12 This will then contain, in addition to the homework, the exams and the quizzes. All right, so now we’ve got all of the assessments for each student all in one DataFrame, but you may have noticed that we do have some missing values.
The default behavior there for pandas is to put in the
NaN value, so Not a Number. It could be that this particular student did not, say, submit the Homework 1, or for whatever reason just did not get a grade for that homework.
07:46 Most likely, they didn’t submit the homework. What we’ll need to do is, because we’re going to be doing numerical computations using these columns, we should set a value, a numerical value, for that.
We’ll set a
0 to all of the missing values in the DataFrame. To do that, we’re going to redefine a final DataFrame, and then the running DataFrame, call the
This will fill any of the
NaN values, or Not a Number values, with a specified value. In our case, we’d simply want to put a
All right, so let’s run that, and now let’s see the first five rows again. That value that was not a number is now
0, and so now we are, you know, making sure that we’ve got all numerical values and we’re ready to start calculating grades.
Hi @taylorchris121, you’ll get that error if you haven’t defined the
roster variable. Check back to the earlier lesson on Loading the Exam and Homework Data, where Cesar shows you how to load the content of the
roster.csv file into a DataFrame called
The link takes you right to the point in the video where he’s assigning the variable you’re missing here. Hope that helps!
Become a Member to join the conversation.
taylorchris121 on July 14, 2021
Everything was working fine upto here