Loading the Exam and Homework Data
All right, we’re all set to load the data. We’re going to be using the
Path class from the
pathlib module, and this is just going to make it easy for us to create paths to the different files that we need to load up.
This will be creating a
Path object to the current working directory. Remember that you should have the structure to your projects so that you’ve got a
data/ folder containing all of the CSV files.
My current working directory where I have my Jupyter Notebook is contained in the project directory. I’m going to create this variable,
CURRENT_DIR (current directory), that contains this
Path object to my working project directory, and then the
DATA_DIR (data directory), which contains all of our CSV files. That’s simply the
All right, so these are two
Path objects that you can sort of almost treat like strings. And we’ve got these operators, like, for instance, this slash (
/) operator that will create a path, and we’ll see how we use that later. All right, so we’ve got these
Path objects, we’re all ready to load the data.
01:43 I’m going to create just a little bit of Markdown here to tell me that this is going to be the spot where I’m loading in the roster. Now to change a cell from a code cell to a Markdown cell, you just press Escape and then hit the M button and then type in your Markdown and then just run it like a regular cell. You can hit Shift + Enter and then you’ll get a new cell—that creates a new code cell.
So if you recall, the
NetID and the
SID point to the same data. Here it’s all uppercase for the
NetID, whereas the
SID was all lowercase. And then also the email address—in this case, they’re all uppercase. We’ll also want to sort of normalize it to, say, lowercase.
ID field is some sort of internal ID that we’re not really going to use. It might be a database ID or just some other ID that really won’t necessarily be useful for us to identify a student, so we won’t have to worry about that ID field.
So, why don’t we go ahead and do some conversion for some of the fields that are coming in. We can pass to the
converters keyword argument in the
read_csv() function, a dictionary that contains the keys, which are going to be the column names that we want to affect, or that we want to change as we read in.
NetID. For the moment, we’re not going to worry about the
Name. We’ll get the
Name later on when we read in the different quizzes or the different homework files. Then, lastly, we want to define the index of this DataFrame.
We want it to be the
"NetID". And so the final DataFrame that we construct is going to have, as the row index, the NetID. This is going to be the unique identifier that we’re going to be using to uniquely determine each student. All right, so that’s going to be the
roster DataFrame. All right, let’s run that.
05:09 All right, so that looks pretty good. Let’s now load the homework assignments. I’m going to go ahead and put a little bit of Markdown there. We’re going to go ahead and load the homework and the exam grades file.
Now, if you remember, the only thing that we were concerned here was that the
SID was in lowercase, and although we know that all of those SIDs were in lowercase, it’s good practice just to make sure that you are guaranteed that everything is going to be converted to lowercase because that’s what we’re doing with the
NetID. So we’re going to go ahead and do that. You know, unless we go ahead and view each row by row visually and manually, we’re not guaranteed that these all come in lowercase, so there’s no harm in going ahead and making that conversion now. Now, if you remember the homework exam grades file, it contained the timestamps of when either the homework or the exams were submitted. This isn’t really important to us, and so what we can do with the
usecols (use columns) key argument is that we can pass in a callable and it’s going to keep any of the field names, or any of the columns, where the callable returns
The columns that we want to omit are the columns that contain the submission timestamp. And so what I’m going to do is use a
lambda function that will only return
True if in the field
title the word
"Submission" is not in the title. Okay, so just to make it clear, what this is doing is we’ve got
title—right, the field
title—and as long as that field
title does not contain the word
"Submission", those are the columns that we want to keep. Again, the timestamp when these homework assignments or exams were submitted is really not all that important. As before, what we want is the index to be the
"SID", which was the same data as the
07:51 just to make sure. All right, so this is, again, the largest DataFrame that we’re going to be constructing from the files. We’re going to have a much larger DataFrame, but this is the largest file. And so, again, we’ve got all these 10 different homework assignments, then we’ve got the exams and they have, again, max points and so on, and same thing for exam number 2.
08:16 And we got rid of all of the submission columns because we’re not concerned about when those different assessments were submitted. Okay. So we’ve got the grades for the homework and then the very last one that we want to load in are the quizzes.
Become a Member to join the conversation.