Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set your subtitle preferences in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please refer to our video player troubleshooting guide for assistance.

Loading the Exam and Homework Data

00:00 All right, we’re all set to load the data. We’re going to be using the Path class from the pathlib module, and this is just going to make it easy for us to create paths to the different files that we need to load up.

00:14 If you’re not familiar with the Path class, maybe this’ll be a good introduction to its use and you’ll see that it’s pretty easy just to use it to load files by creating Path objects to them.

00:25 We’ve got the Path class that we’re going to be using and then, of course, we’re going to be using pandas, so let’s load that in. We’re also going to be using the NumPy module.

00:36 We’re not going to be using NumPy right away but we might as well load it now. All right, so we’ve got those few things loaded. So from the Path class, we can get a Path object.

00:48 This will be creating a Path object to the current working directory. Remember that you should have the structure to your projects so that you’ve got a data/ folder containing all of the CSV files.

01:01 My current working directory where I have my Jupyter Notebook is contained in the project directory. I’m going to create this variable, CURRENT_DIR (current directory), that contains this Path object to my working project directory, and then the DATA_DIR (data directory), which contains all of our CSV files. That’s simply the data/ folder.

01:25 All right, so these are two Path objects that you can sort of almost treat like strings. And we’ve got these operators, like, for instance, this slash (/) operator that will create a path, and we’ll see how we use that later. All right, so we’ve got these Path objects, we’re all ready to load the data.

01:43 I’m going to create just a little bit of Markdown here to tell me that this is going to be the spot where I’m loading in the roster. Now to change a cell from a code cell to a Markdown cell, you just press Escape and then hit the M button and then type in your Markdown and then just run it like a regular cell. You can hit Shift + Enter and then you’ll get a new cell—that creates a new code cell.

02:08 Let’s go ahead and run that. Now, we know that the pandas module contains a read_csv() function. Let’s go ahead and read the roster file.

02:18 We’ve got the DATA_DIR Path object and we just simply need to add in the "roster.csv" to the path so we can point to the file that we want to read in.

02:30 So if I run this, we get our roster DataFrame. This creates a DataFrame. And what we’re going to do is we’re going to start normalizing some of the data.

02:41 So if you recall, the NetID and the SID point to the same data. Here it’s all uppercase for the NetID, whereas the SID was all lowercase. And then also the email address—in this case, they’re all uppercase. We’ll also want to sort of normalize it to, say, lowercase.

03:00 And this ID field is some sort of internal ID that we’re not really going to use. It might be a database ID or just some other ID that really won’t necessarily be useful for us to identify a student, so we won’t have to worry about that ID field.

03:19 Then we saw that we’ve got the different values of a section for the students.

03:25 So, why don’t we go ahead and do some conversion for some of the fields that are coming in. We can pass to the converters keyword argument in the read_csv() function, a dictionary that contains the keys, which are going to be the column names that we want to affect, or that we want to change as we read in.

03:49 And what we want to do is just call the .lower() method from the str (string) class to convert all of the NetIDs to lowercase letters.

03:58 Then we want to do the exact same thing for the "Email Address" field, so let’s go ahead and do that, and this will be all lowercase.

04:07 Then another thing that we want to do is we don’t care much about that ID column, so let’s be explicit here about what columns to use when we read in the CSV file.

04:18 We certainly want the "Section" and the "Email Address"

04:24 and the NetID. For the moment, we’re not going to worry about the Name. We’ll get the Name later on when we read in the different quizzes or the different homework files. Then, lastly, we want to define the index of this DataFrame.

04:39 We want it to be the "NetID". And so the final DataFrame that we construct is going to have, as the row index, the NetID. This is going to be the unique identifier that we’re going to be using to uniquely determine each student. All right, so that’s going to be the roster DataFrame. All right, let’s run that.

05:01 And let’s just take a look at the first, say, 10 student names.

05:09 All right, so that looks pretty good. Let’s now load the homework assignments. I’m going to go ahead and put a little bit of Markdown there. We’re going to go ahead and load the homework and the exam grades file.

05:27 This is going to be similar to the roster file. We’re going to read in

05:35 using the DATA_DIR object and we’re going to go ahead and read in the hw_exam_grades.csv file.

05:46 Now, if you remember, the only thing that we were concerned here was that the SID was in lowercase, and although we know that all of those SIDs were in lowercase, it’s good practice just to make sure that you are guaranteed that everything is going to be converted to lowercase because that’s what we’re doing with the NetID. So we’re going to go ahead and do that. You know, unless we go ahead and view each row by row visually and manually, we’re not guaranteed that these all come in lowercase, so there’s no harm in going ahead and making that conversion now. Now, if you remember the homework exam grades file, it contained the timestamps of when either the homework or the exams were submitted. This isn’t really important to us, and so what we can do with the usecols (use columns) key argument is that we can pass in a callable and it’s going to keep any of the field names, or any of the columns, where the callable returns True.

06:49 The columns that we want to omit are the columns that contain the submission timestamp. And so what I’m going to do is use a lambda function that will only return True if in the field title the word "Submission" is not in the title. Okay, so just to make it clear, what this is doing is we’ve got titleright, the field titleand as long as that field title does not contain the word "Submission", those are the columns that we want to keep. Again, the timestamp when these homework assignments or exams were submitted is really not all that important. As before, what we want is the index to be the "SID", which was the same data as the NetID.

07:39 All right, so go ahead and run that. Let’s call this the hw_exam_grades (homework exam grades),

07:47 and let’s take a look at the first few or so

07:51 just to make sure. All right, so this is, again, the largest DataFrame that we’re going to be constructing from the files. We’re going to have a much larger DataFrame, but this is the largest file. And so, again, we’ve got all these 10 different homework assignments, then we’ve got the exams and they have, again, max points and so on, and same thing for exam number 2.

08:16 And we got rid of all of the submission columns because we’re not concerned about when those different assessments were submitted. Okay. So we’ve got the grades for the homework and then the very last one that we want to load in are the quizzes.

opabrown on Aug. 21, 2021

how do i get access to the input cvs files?

Martin Breuss RP Team on Aug. 23, 2021

Hi @opabrown, you can download the CSV data from the code opt-in that you can find in the linked article in Supporting Material.

Here’s a link to the GitHub repository of the tutorial, and specifically the data/ directory that contains all the CSV files.

Valerii on Dec. 12, 2021


For some reason I cannot set index_col to "NetID". Getting KeyError "NetID" even though calling df['NetID'] works just fine.

Felipe Sebben on Jan. 25, 2022

Wow, what a great class! Thank you for teaching how to use Path, it has really made my life easier! Do have any projects like this one that you’d recommend? I am interested in data analysis! Thank you once again!

Become a Member to join the conversation.