Loading the Quiz Files

Using pandas to Make a Gradebook in Python Cesar Aguilar 08:44

Transcript
Discussion (4)

00:00 All right, so now we’re ready to load up all of the quiz files. Let’s go ahead and put a little bit of mark up here and we’ll just say Load quiz grades. And if you remember, these files were called quiz_ and then the number of the quiz and then grades, and then we’ve got a CSV file, and there were five of these quizzes, so five of these files. Okay, so we’re going to load those up. Now, the way we’re going to do that is we’re going to create one DataFrame to contain all of the grades, all of the quiz grades.

00:39 What we’ll do is we’ll need to loop over each of the quiz files, load them up as DataFrames, and then simply concatenate them to a running DataFrame that we’re going to create as the final one that contains all the grades.

00:53 Let’s create an empty DataFrame,

00:59 and we’ll call it quiz_grades. Now, the DATA_DIR (data directory),

01:06 this is a Path object and it’s a directory, and so it contains this, we can use the .glob() method that will match a pattern and that will return a generator. All right, and each of the elements that the generator returns is a path to the file matching the pattern. So in this case, the pattern is "quiz", underscore ("_"), and that wildcard character (*) is what’s changing in the different file names, and they’re all named "_grades.csv".

01:38 So, this will return a generator and we want to loop over the generator. It’s going to be returning a file path that matches the pattern. What do we want to do with this file? Well, we’re going to want to read it in, obviously, get the CSV file.

01:59 And what we want to do is the DataFrame that we’re building here, it’s going to contain five columns for each of the different quiz grade files, and what we want to do is we want to call each of those field names, just either Quiz 1, Quiz 2, Quiz 3, depending on what file we’re reading in.

02:20 Then for the index, or the indices, of the quiz_grades DataFrame, we’re going to be using the email address. If you remember these quiz files, they all contain the email address for the student.

02:33 Here’s what we’ll do. Let’s load up, individually, each of the quiz files. This is similar to what we did before. We’re going to read in the CSV file and this is going to be, of course, the file path, right?

02:51 As we’re looping over the objects returned to us by the .glob() generator, these are just simply Path file_path objects, and so that’s the file_path.

03:02 Then we want to do the converter conversion again that converts the "Email" field, all of the elements in the "Email" field, into lowercase letters.

03:14 We’re going to use the index to be the email address. If you remember in all these quiz files, we had just the first name and the last name of the student, email address, and then just simply the grade.

03:27 We’re going to use the email as the index of this DataFrame.

03:32 And then the only columns that we care about are going to be the email address and the grade of the quiz.

03:41 Then we are going to take each individual quiz and we’re just simply going to concatenate it to this running quiz_grades DataFrame. So we’ll concatenate running quiz_grades DataFrame with the current quiz DataFrame.

04:00 And we want to do this along the second axis. We don’t want to stack these DataFrames, we want to concatenate them along columns. In other words, we’re adding a new series, or a new column, containing the current quiz grade to this quiz_grades DataFrame.

04:21 Let’s go ahead and run that, see what we get, see if this makes sense. Let’s take a look at the first five.

04:31 Okay, so this looks good. The only thing is that, of course, because all of the fields or all of the columns in the individual quiz files were all Grade, then that’s what they come in as, right? When we concatenate them into the quiz_grades DataFrame, they’re all the same name. Now, of course, we know that if these were put in where this one was the first quiz, the second quiz, the third quiz, and the fourth and the fifth—assuming of course that the .glob() object generated them in that order, and there’s no guarantee that that’s going to happen.

05:07 So what we want to do is before we concatenate the current quiz, let’s rename the column grades. We need to pass in a dictionary and the column, or the field, that we want to rename is the "Grade" one.

05:25 What we want here is basically we need to pull out this information from the file_path object, because that’s going to contain the quiz and then the number. And so here, what we’ll do is before we pass that into the .rename() of the columns, let’s create a quiz_name variable.

05:49 We’re going to take this file_path. It’s like a string, but it’s basically a Path object, and this Path object contains a .stem variable.

05:59 This is a string that will contain only the portion of the Path object that’s at the very end without the suffix .csv. All right?

06:09 So essentially, we’re going to be getting "quiz_", the number, and then "_grades", and that’s what’s in the stem. Then we’re going to take that string and we’re just going to capitalize the first letter of the string, which is "quiz", and then we just simply want to split by underscore ("_").

06:27 And we only want, essentially, this number, but we might as well just keep the "quiz" as well. This will create a list. Right now, this will be a list containing the elements "quiz_", number, "_grades".

06:42 And we only want the first two elements, so we’re going to go from the first to the second, so we go all the way up to 2. That’ll be a list, and then all we need to do is join using a blank space (" ") so that we then get just the word "Quiz", capitalized, and then with a space and then the number, right?

07:05 So this is essentially going to give us "Quiz 1", "Quiz 2", "Quiz 3", and so on. That is what we want to rename the "Grade" field when we create this DataFrame, and so here we’ll pass in quiz_name.

07:21 All right, so let’s run this again.

07:25 Oh, we got an error. All right, so it looks like I just forgot to write down that I want to join that list of elements—only the first two—just with a blank space. All right, so let’s run that again.

07:42 Okay, that looks good. And then if I run this again, there we go. We’ve got the field names where the column names are, just whatever quiz we loaded at that time.

07:53 Notice that the .glob() method doesn’t guarantee that there’s any type of sorting being done. It’s just sort of whatever the generator returns to us is the order in which we’re going to be creating these fields,, or these columns, and that’s fine.

08:09 All right, so that loads up all of the quizzes. We’ve got them all in the quiz_grades DataFrame

08:15 and we’ve got all of the homework and exam grades in this DataFrame called hw_exam_grades (homework exam grades). And then lastly, of course, we’ve got the roster, the roster contains just the NetIDs, the email address, and then the section number of the student.

08:33 Now we want to take all of this data and merge it and create one DataFrame that contains all the data. We’ll do that next.

Karel Dongen on June 25, 2021

Why not create a list with dataframes, and then concat them? Something like, as that can lead to a 50% speedup:

import pathlib2 as pl2
ps = pl2.Path('data/sp3')
dfs = (
    pd.read_csv(p, encoding='utf8') for p in ps.glob('*.csv')
)
res = pd.concat(dfs)

Martin Breuss RP Team on June 28, 2021

@Karel Dongen, that’s correct. If you need to concatenate a lot of dataframes, then using the comprehension you’re showing to first collect all the dataframes, and then .concat() them in one go will be quicker.

Here’s the official note from the pandas documentation on concatenating objects:

It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

You can find this note towards the bottom of the linked section in the docs.

Felipe Sebben on Jan. 25, 2022

I’m having trouble concatenating the .csv files of the quizzes. The codes seem to be fine, but it seems that pandas only concatenates quiz 1 (and names it as “Quiz 5):

Email Quiz 5 rachel.dennis@univ.edu 11 courtney.wolf@univ.edu 9 aaron.lester@univ.edu 8 ashley.martin@univ.edu 11 jessica.garza@univ.edu 10

Felipe Sebben on Jan. 25, 2022

Sorry about last comment. Here it goes:

Email                   Quiz 5  
rachel.dennis@univ.edu  11
courtney.wolf@univ.edu  9
aaron.lester@univ.edu   8
ashley.martin@univ.edu  11
jessica.garza@univ.edu  10

Become a Member to join the conversation.