Exploring the Data
00:00 All right. Let’s explore the data a little bit. We’ll do a lot more exploring in more detail when we actually start coding, but it’s a good idea just to get a general overview of what the data is like.
Here’s a suggested project folder structure that you may adopt. You can create some sort of root folder. You can give it a name, something like
gradebook_project, and then we’re going to have a main script file. Because we’re going to be using Jupyter, we’re going to have a
That folder contains all of our CSV files. There’s a
roster.csv file and then a file that contains the homework and the exam grades of all the students, and then a whole bunch—or five—quiz files that contain the grades for the quizzes, and these are all structured the exact same way.
Then we’ve got an identifier. This is a
NetID. It’s all uppercase. Then we’ve got an email address for the students, and again, these are all uppercase. And then, because this is a large university course, we’ve got 150 students, and what is usually done in a university course is that we subdivide the course into sections, just to make them more manageable for finding rooms in the university and things like that and also just for grading. All right, so this is the
Here we’ve got a different name for a column that refers to the same data as in the
roster.csv file that it was referring to the
NetID, so that’ll obviously be an important thing that we need to worry about when we go ahead and start merging the data.
02:26 Then we’ve got a sequence of grades and information about the homework assignments and the exam grades. There are ten homework assignments and for each one we’ve got three fields associated with each one.
02:42 There’s the actual grade for that particular homework assignment and then the maximum number of points for that homework assignment, so for each homework assignment, this will be the exact same number.
02:52 So this data here is redundant, but that’s okay. That’s how we get the data. And then we have some field, or some column, that has a timestamp of when that homework assignment was submitted. And so this pattern of three fields—it repeats for the other nine homework assignments.
Then we have this exact same pattern repeating for the exams. We’ll have an
'Exam 1' field with the grade for that exam, and then the
'Exam 1 - Max Points' field for the maximum points for exam number 1, and then a timestamp as well, and then we’ll have this for exam number 2 and exam number 3.
Then we’ve got the quiz grades CSV files, and these are all structured in the exact same way. We’ll have a
'Last Name' field, the
'First Name' field, and then email address, and then the actual grade that the students received in that particular quiz.
03:45 The maximum number of points for each particular quiz is not included in the file, and so what we’ll do is when we’re actually coding, we’ll just simply either define a list or a dictionary that contains this information that we can use when we compute the average of the quizzes.
04:03 Then there are four other quiz files—there’s five in total—and they all have the exact same structure. Now, notice one thing: that here in the email address field, all of the emails, they are in lowercase, whereas in the roster file, all the email addresses were in uppercase.
04:19 So these are just some things that we have to keep in mind when we load the data and we merge the data so that we’re, in essence, normalizing the data in some way so that it all looks the same and we don’t have any risk of either missing any data or duplicating any data.
All right, so let’s see what we saw with the data. Well, each table has a different representation of the students’ names. So, for example, in the
roster.csv file, the last name and first name were all in one field and it was just simply a string separated by a comma.
05:06 And then the students’ email addresses, they don’t have the same elements. Now, there is a standard pattern, which is the first name and then the last name, but if this is not a unique string, then the university system would go ahead and add, maybe, an extra digit somewhere to modify an email address if it’s not unique.
We also saw that some of the columns—in particular, the
NetID and the
SID in the two different files—they refer to the same data, but the names are different, the field names are different. And so, again, this is a key thing, that we’ll have to make sure that we’re aware of this when we’re merging in the data. And then, each of the tables, they sort the data differently. Pretty much all of it is randomly sorted, and so, again, we can’t rely on this when we’re merging or loading in the data to have some sort of predefined sorted structure. And then, lastly, some of the tables, they have missing values. So again, we’ll have to make sure that we do something when we load and work with the data to deal with these missing values when we’re doing the numerical computations of the final grades. So, that’s a quick overview of what the data looks like.
Become a Member to join the conversation.