Exploring the Data
00:00 All right. Let’s explore the data a little bit. We’ll do a lot more exploring in more detail when we actually start coding, but it’s a good idea just to get a general overview of what the data is like.
00:11
Here’s a suggested project folder structure that you may adopt. You can create some sort of root folder. You can give it a name, something like gradebook_project
, and then we’re going to have a main script file. Because we’re going to be using Jupyter, we’re going to have a .ipynb
file.
00:28
And if you’re using just a regular Python script, you’re going to have a .py
file. Then when we start coding, we’re going to be assuming that there’s a data/
folder.
00:38
That folder contains all of our CSV files. There’s a roster.csv
file and then a file that contains the homework and the exam grades of all the students, and then a whole bunch—or five—quiz files that contain the grades for the quizzes, and these are all structured the exact same way.
00:58
Let’s take a look at these individually just to get an idea of what’s in them. The roster.csv
file is pretty straightforward. It contains the basic information about the student.
01:09
We’ve got the name of the student. Notice that this is <last name>, <first name>
.
01:15
Then we’ve got an identifier. This is a NetID
. It’s all uppercase. Then we’ve got an email address for the students, and again, these are all uppercase. And then, because this is a large university course, we’ve got 150 students, and what is usually done in a university course is that we subdivide the course into sections, just to make them more manageable for finding rooms in the university and things like that and also just for grading. All right, so this is the roster.csv
file.
01:48
Then we’ve got the homework and exam grades .csv
file. This is by far the largest file. It contains an SID
field. The NetID
and the SID
refer to the same data.
02:01
Here we’ve got a different name for a column that refers to the same data as in the roster.csv
file that it was referring to the NetID
, so that’ll obviously be an important thing that we need to worry about when we go ahead and start merging the data.
02:17
Then we’ve got the first name and last name for the student in two different fields, different than what was done in the roster.csv
file, where it was all contained in one field.
02:26 Then we’ve got a sequence of grades and information about the homework assignments and the exam grades. There are ten homework assignments and for each one we’ve got three fields associated with each one.
02:42 There’s the actual grade for that particular homework assignment and then the maximum number of points for that homework assignment, so for each homework assignment, this will be the exact same number.
02:52 So this data here is redundant, but that’s okay. That’s how we get the data. And then we have some field, or some column, that has a timestamp of when that homework assignment was submitted. And so this pattern of three fields—it repeats for the other nine homework assignments.
03:10
Then we have this exact same pattern repeating for the exams. We’ll have an 'Exam 1'
field with the grade for that exam, and then the 'Exam 1 - Max Points'
field for the maximum points for exam number 1, and then a timestamp as well, and then we’ll have this for exam number 2 and exam number 3.
03:30
Then we’ve got the quiz grades CSV files, and these are all structured in the exact same way. We’ll have a 'Last Name'
field, the 'First Name'
field, and then email address, and then the actual grade that the students received in that particular quiz.
03:45 The maximum number of points for each particular quiz is not included in the file, and so what we’ll do is when we’re actually coding, we’ll just simply either define a list or a dictionary that contains this information that we can use when we compute the average of the quizzes.
04:03 Then there are four other quiz files—there’s five in total—and they all have the exact same structure. Now, notice one thing: that here in the email address field, all of the emails, they are in lowercase, whereas in the roster file, all the email addresses were in uppercase.
04:19 So these are just some things that we have to keep in mind when we load the data and we merge the data so that we’re, in essence, normalizing the data in some way so that it all looks the same and we don’t have any risk of either missing any data or duplicating any data.
04:36 We’ll deal with this when we actually load and merge the data.
04:41
All right, so let’s see what we saw with the data. Well, each table has a different representation of the students’ names. So, for example, in the roster.csv
file, the last name and first name were all in one field and it was just simply a string separated by a comma.
04:58
Whereas in the hw_exam_grades.csv
file, we had two separate fields: one for the last name and one for the first name.
05:06 And then the students’ email addresses, they don’t have the same elements. Now, there is a standard pattern, which is the first name and then the last name, but if this is not a unique string, then the university system would go ahead and add, maybe, an extra digit somewhere to modify an email address if it’s not unique.
05:28
We also saw that some of the columns—in particular, the NetID
and the SID
in the two different files—they refer to the same data, but the names are different, the field names are different. And so, again, this is a key thing, that we’ll have to make sure that we’re aware of this when we’re merging in the data. And then, each of the tables, they sort the data differently. Pretty much all of it is randomly sorted, and so, again, we can’t rely on this when we’re merging or loading in the data to have some sort of predefined sorted structure. And then, lastly, some of the tables, they have missing values. So again, we’ll have to make sure that we do something when we load and work with the data to deal with these missing values when we’re doing the numerical computations of the final grades. So, that’s a quick overview of what the data looks like.
06:18 Why don’t we go ahead and jump in and start loading the data and then start computing!
Become a Member to join the conversation.