Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Merging the Grade DataFrames

00:00 To start with the merging, we’re going to work with the roster DataFrame and the homework exam grades DataFrame. Both of these, their indices—for the roster, it’s NetID and for the homework exam grades, it’s SID—they point to the same data, so it’s simply two different names for the same data that they’re storing, the same strings. So this NetID or SID, it’s going to uniquely identify each student, and so we’re going to be merging these two DataFrames on the index.

00:35 Let’s go ahead and do that. Maybe what we’ll do is, let’s call it maybe final_dataframe or final_data, or maybe we’ll call it, say, final_df.

00:48 The built-in merge() function in pandas is the function that does the merging. There’s other ways to do it and we’ll talk about that in a second.

00:58 We want to merge the roster and we want to merge the hw_exam_grades to create a new DataFrame. Now, the default behavior of merge() is to find a common column on both of the two DataFrames that you want to merge and use the column as the key to line up the rows in each of the DataFrames to do the merging.

01:22 So essentially, it’s going to be a concatenation of the two DataFrames to create a new DataFrame. Now, both of these DataFrames—they don’t have a common column. If you recall, the roster only has the email address and the section of the student, whereas the hw_exam_grades has the first name, last name, and then all of the homework scores and all of the exam scores.

01:44 So when I run this, we’re going to get an error, but I just want you to see what happens.

01:49 The problem is, like we said—if I scroll down here the bottom—there’s no common column to perform the merge() on, right? These DataFrames don’t have a common column, so we need to specify the key that should be used to line up or to match up the data from the two DataFrames. Now, if they have more than one common column, then we need to specify either a left_on column name for the left DataFrame—so that would be the first one in the arguments to the merge() function—and then the right column would be the second one that we’re passing into the merge() function and then we would specify what column.

02:30 So the idea there is that in the left DataFrame, we have some column that matches up with the right DataFrame on some other column, and they don’t have to even be the same name, but just point to the same data so that we can do the matching.

02:44 We can also use the indices. We can specify that we want the keys to be either left index—so the index of the left DataFrame—or the right index of the right DataFrame. And we can mix and match, so for example, we can on the left column and say the right index and so on.

03:04 Now, in our case, they do have the same index, so if we merge the two DataFrames on the indices—so hw_exam_gradeson the left one, we’ve got the left_index and this is a flag variable and the default is False. We want to use that as the key for the left one, so we’ll pass in True. And then same thing for the right DataFrame, we want to use the index as the key.

03:35 This will essentially take the rows of each of the two DataFrames and match them based on the same index value, and then simply concatenate and create a new DataFrame. All right, so let’s call this, say, final_df like we had up there. And, you know, we should probably do this, right?

03:56 Let’s get rid of this. And let’s just go back up here and do it right up here. We’ll say left_index=True and right_index=True. All right.

04:11 That ran nicely. Let’s take a look at the first five rows.

04:16 There, we’ve got Email Address—that came from the roster and so did the Section, and the indices are the same as they were. We just matched the two indices with the left and the right, and now we have all the data that came to us from the hw_exam_grades DataFrame: the first name and the last name of the student and then all of the homework and all of the exam grades.

04:40 Because it’s pretty common to match on the indices,

04:44 there is also a .join() method that we can use on a DataFrame with a DataFrame. So if we went roster.join() and we want to join with the hw_exam_grades, and then we don’t need any of this because that’s for the merge. All right, so this accomplishes the same thing.

05:03 We are joining to the roster DataFrame the hw_exam_grades, and the default behavior here is to do it on the indices, which they do have the same.

05:13 Let’s run that and let’s run this again, and we get the exact same thing. Okay, so that is the merge for the roster with the hw_exam_grades.

05:26 Now let’s take this final_df, or final DataFrame, and let’s also now merge the quiz grades.

05:35 Let’s quickly just remind ourselves the quiz_grades DataFrame. The columns are just simply the five quiz grades and the index is the email address, right?

05:51 So the quiz_grades index, email address—we know that in the final DataFrame that we’re just constructing now, we have a column that has all the email addresses, so that’s going to be our key to merge these two.

06:05 We’ll use the merge() function.

06:09 It’s going to be that final_df and we’re going to be merging the quiz_grades, and for the left DataFrame, which is the final_df, that’s going to be the key. So left_on, we’re passing in what column to use as the key for the left DataFrame, the final_df, it should be the "Email Address", right?

06:34 Let me come back up here. That is the name, "Email Address". And then for the right one, it’s the index. We want the key for the right one to be the right index, so we’ll pass that in as a True value for that flag.

06:53 And that is it! So we’ll rename, or we’ll reconstruct, this final_df DataFrame and we’ll run that, and that looks good. And then let’s just take a look at our final DataFrame.

07:09 Okay. All right, and so there we go.

07:12 This will then contain, in addition to the homework, the exams and the quizzes. All right, so now we’ve got all of the assessments for each student all in one DataFrame, but you may have noticed that we do have some missing values.

07:29 The default behavior there for pandas is to put in the NaN value, so Not a Number. It could be that this particular student did not, say, submit the Homework 1, or for whatever reason just did not get a grade for that homework.

07:46 Most likely, they didn’t submit the homework. What we’ll need to do is, because we’re going to be doing numerical computations using these columns, we should set a value, a numerical value, for that.

07:57 We’ll set a 0 to all of the missing values in the DataFrame. To do that, we’re going to redefine a final DataFrame and then the running DataFrame called the .fillna() method.

08:14 This will fill any of the NaN values, or Not a Number values, with a specified value. In our case, we’d simply want to put a 0.

08:25 All right, so let’s run that, and now let’s see the first five rows again. That value that was not a number is now 0, and so now we are, you know, making sure that we’ve got all numerical values and we’re ready to start calculating grades.

taylorchris121 on July 14, 2021

Hi

Everything was working fine upto here

final_df = pd.merge(roster, hw_exam_grades, left_index=True, right_index=True)

Traceback (most recent call last)
/var/folders/3m/gxnzlxlj6vx_yzyfk3y36lkh0000gn/T/ipykernel_9194/622835086.py in <module>
----> 1 final_df = pd.merge(roster, hw_exam_grades, left_index=True, right_index=True)

NameError: name 'roster' is not defined

Martin Breuss RP Team on July 15, 2021

Hi @taylorchris121, you’ll get that error if you haven’t defined the roster variable. Check back to the earlier lesson on Loading the Exam and Homework Data, where Cesar shows you how to load the content of the roster.csv file into a DataFrame called roster.

The link takes you right to the point in the video where he’s assigning the variable you’re missing here. Hope that helps!

Become a Member to join the conversation.