Loading the Exam and Homework Data
All right, we’re all set to load the data. We’re going to be using the
Path class from the
pathlib module, and this is just going to make it easy for us to create paths to the different files that we need to load up.
If you’re not familiar with the
Path class, maybe this’ll be a good introduction to its use and you’ll see that it’s pretty easy just to use it to load files by creating
Path objects to them.
We’ve got the
Path class that we’re going to be using and then, of course, we’re going to be using pandas, so let’s load that in. We’re also going to be using the NumPy module.
We’re not going to be using NumPy right away but we might as well load it now. All right, so we’ve got those few things loaded. So from the
Path class, we can get a
This will be creating a
Path object to the current working directory. Remember that you should have the structure to your projects so that you’ve got a
data/ folder containing all of the CSV files.
My current working directory where I have my Jupyter Notebook is contained in the project directory. I’m going to create this variable,
CURRENT_DIR (current directory), that contains this
Path object to my working project directory, and then the
DATA_DIR (data directory), which contains all of our CSV files. That’s simply the
All right, so these are two
Path objects that you can sort of almost treat like strings. And we’ve got these operators, like, for instance, this slash (
/) operator that will create a path, and we’ll see how we use that later. All right, so we’ve got these
Path objects, we’re all ready to load the data.
01:43 I’m going to create just a little bit of Markdown here to tell me that this is going to be the spot where I’m loading in the roster. Now to change a cell from a code cell to a Markdown cell, you just press Escape and then hit the M button and then type in your Markdown and then just run it like a regular cell. You can hit Shift + Enter and then you’ll get a new cell—that creates a new code cell.
Let’s go ahead and run that. Now, we know that the
pandas module contains a
read_csv() function. Let’s go ahead and read the roster file.
We’ve got the
Path object and we just simply need to add in the
"roster.csv" to the path so we can point to the file that we want to read in.
02:30 So if I run this, we get our roster DataFrame. This creates a DataFrame. And what we’re going to do is we’re going to start normalizing some of the data.
So if you recall, the
NetID and the
SID point to the same data. Here it’s all uppercase for the
NetID, whereas the
SID was all lowercase. And then also the email address—in this case, they’re all uppercase. We’ll also want to sort of normalize it to, say, lowercase.
ID field is some sort of internal ID that we’re not really going to use. It might be a database ID or just some other ID that really won’t necessarily be useful for us to identify a student, so we won’t have to worry about that ID field.
03:19 Then we saw that we’ve got the different values of a section for the students.
So, why don’t we go ahead and do some conversion for some of the fields that are coming in. We can pass to the
converters keyword argument in the
read_csv() function, a dictionary that contains the keys, which are going to be the column names that we want to affect, or that we want to change as we read in.
And what we want to do is just call the
.lower() method from the
str (string) class to convert all of the NetIDs to lowercase letters.
Then we want to do the exact same thing for the
"Email Address" field, so let’s go ahead and do that, and this will be all lowercase.
Then another thing that we want to do is we don’t care much about that
ID column, so let’s be explicit here about what columns to use when we read in the CSV file.
We certainly want the
"Section" and the
NetID. For the moment, we’re not going to worry about the
Name. We’ll get the
Name later on when we read in the different quizzes or the different homework files. Then, lastly, we want to define the index of this DataFrame.
We want it to be the
"NetID". And so the final DataFrame that we construct is going to have, as the row index, the NetID. This is going to be the unique identifier that we’re going to be using to uniquely determine each student. All right, so that’s going to be the
roster DataFrame. All right, let’s run that.
05:01 And let’s just take a look at the first, say, 10 student names.
05:09 All right, so that looks pretty good. Let’s now load the homework assignments. I’m going to go ahead and put a little bit of Markdown there. We’re going to go ahead and load the homework and the exam grades file.
05:27 This is going to be similar to the roster file. We’re going to read in
DATA_DIR object and we’re going to go ahead and read in the
Now, if you remember, the only thing that we were concerned here was that the
SID was in lowercase, and although we know that all of those SIDs were in lowercase, it’s good practice just to make sure that you are guaranteed that everything is going to be converted to lowercase because that’s what we’re doing with the
NetID. So we’re going to go ahead and do that. You know, unless we go ahead and view each row by row visually and manually, we’re not guaranteed that these all come in lowercase, so there’s no harm in going ahead and making that conversion now. Now, if you remember the homework exam grades file, it contained the timestamps of when either the homework or the exams were submitted. This isn’t really important to us, and so what we can do with the
usecols (use columns) key argument is that we can pass in a callable and it’s going to keep any of the field names, or any of the columns, where the callable returns
The columns that we want to omit are the columns that contain the submission timestamp. And so what I’m going to do is use a
lambda function that will only return
True if in the field
title the word
"Submission" is not in the title. Okay, so just to make it clear, what this is doing is we’ve got
title—right, the field
title—and as long as that field
title does not contain the word
"Submission", those are the columns that we want to keep. Again, the timestamp when these homework assignments or exams were submitted is really not all that important. As before, what we want is the index to be the
"SID", which was the same data as the
All right, so go ahead and run that. Let’s call this the
hw_exam_grades (homework exam grades),
07:47 and let’s take a look at the first few or so
07:51 just to make sure. All right, so this is, again, the largest DataFrame that we’re going to be constructing from the files. We’re going to have a much larger DataFrame, but this is the largest file. And so, again, we’ve got all these 10 different homework assignments, then we’ve got the exams and they have, again, max points and so on, and same thing for exam number 2.
08:16 And we got rid of all of the submission columns because we’re not concerned about when those different assessments were submitted. Okay. So we’ve got the grades for the homework and then the very last one that we want to load in are the quizzes.
Hi @opabrown, you can download the CSV data from the code opt-in that you can find in the linked article in Supporting Material.
Here’s a link to the GitHub repository of the tutorial, and specifically the
data/ directory that contains all the CSV files.
For some reason I cannot set
KeyError "NetID" even though calling
df['NetID'] works just fine.
Wow, what a great class! Thank you for teaching how to use Path, it has really made my life easier! Do have any projects like this one that you’d recommend? I am interested in data analysis! Thank you once again!
Become a Member to join the conversation.
opabrown on Aug. 21, 2021
how do i get access to the input cvs files?