Merging the Grade DataFrames
To start with the merging, we’re going to work with the roster DataFrame and the homework exam grades DataFrame. Both of these, their indices—for the roster, it’s
NetID and for the homework exam grades, it’s
SID—they point to the same data, so it’s simply two different names for the same data that they’re storing, the same strings. So this
SID, it’s going to uniquely identify each student, and so we’re going to be merging these two DataFrames on the index.
We want to merge the
roster and we want to merge the
hw_exam_grades to create a new DataFrame. Now, the default behavior of
merge() is to find a common column on both of the two DataFrames that you want to merge and use the column as the key to line up the rows in each of the DataFrames to do the merging.
So essentially, it’s going to be a concatenation of the two DataFrames to create a new DataFrame. Now, both of these DataFrames—they don’t have a common column. If you recall, the
roster only has the email address and the section of the student, whereas the
hw_exam_grades has the first name, last name, and then all of the homework scores and all of the exam scores.
The problem is, like we said—if I scroll down here the bottom—there’s no common column to perform the
merge() on, right? These DataFrames don’t have a common column, so we need to specify the key that should be used to line up or to match up the data from the two DataFrames. Now, if they have more than one common column, then we need to specify either a
left_on column name for the left DataFrame—so that would be the first one in the arguments to the
merge() function—and then the right column would be the second one that we’re passing into the
merge() function and then we would specify what column.
02:30 So the idea there is that in the left DataFrame, we have some column that matches up with the right DataFrame on some other column, and they don’t have to even be the same name, but just point to the same data so that we can do the matching.
02:44 We can also use the indices. We can specify that we want the keys to be either left index—so the index of the left DataFrame—or the right index of the right DataFrame. And we can mix and match, so for example, we can on the left column and say the right index and so on.
Now, in our case, they do have the same index, so if we merge the two DataFrames on the indices—so
hw_exam_grades—on the left one, we’ve got the
left_index and this is a flag variable and the default is
False. We want to use that as the key for the left one, so we’ll pass in
True. And then same thing for the right DataFrame, we want to use the index as the key.
This will essentially take the rows of each of the two DataFrames and match them based on the same index value, and then simply concatenate and create a new DataFrame. All right, so let’s call this, say,
final_df like we had up there. And, you know, we should probably do this, right?
There, we’ve got
Email Address—that came from the roster and so did the
Section, and the indices are the same as they were. We just matched the two indices with the left and the right, and now we have all the data that came to us from the
hw_exam_grades DataFrame: the first name and the last name of the student and then all of the homework and all of the exam grades.
there is also a
.join() method that we can use on a DataFrame with a DataFrame. So if we went
roster.join() and we want to join with the
hw_exam_grades, and then we don’t need any of this because that’s for the merge. All right, so this accomplishes the same thing.
quiz_grades index, email address—we know that in the final DataFrame that we’re just constructing now, we have a column that has all the email addresses, so that’s going to be our key to merge these two.
It’s going to be that
final_df and we’re going to be merging the
quiz_grades, and for the left DataFrame, which is the
final_df, that’s going to be the key. So
left_on, we’re passing in what column to use as the key for the left DataFrame, the
final_df, it should be the
"Email Address", right?
Let me come back up here. That is the name,
"Email Address". And then for the right one, it’s the index. We want the key for the right one to be the right index, so we’ll pass that in as a
True value for that flag.
07:12 This will then contain, in addition to the homework, the exams and the quizzes. All right, so now we’ve got all of the assessments for each student all in one DataFrame, but you may have noticed that we do have some missing values.
The default behavior there for pandas is to put in the
NaN value, so Not a Number. It could be that this particular student did not, say, submit the Homework 1, or for whatever reason just did not get a grade for that homework.
07:46 Most likely, they didn’t submit the homework. What we’ll need to do is, because we’re going to be doing numerical computations using these columns, we should set a value, a numerical value, for that.
All right, so let’s run that, and now let’s see the first five rows again. That value that was not a number is now
0, and so now we are, you know, making sure that we’ve got all numerical values and we’re ready to start calculating grades.
Become a Member to join the conversation.