Pandas Project: Make a Gradebook With Pandas

Pandas Project: Make a Gradebook With Python & Pandas

by Bryan Weber Jul 15, 2020 advanced data-science projects

One of the jobs that all teachers have in common is evaluating students. Whether you use exams, homework assignments, quizzes, or projects, you usually have to turn students’ scores into a letter grade at the end of the term. This often involves a bunch of calculations that you might do in a spreadsheet. Instead, you can consider using Python and pandas.

One problem with using a spreadsheet is that it can be hard to see when you make a mistake in a formula. Maybe you selected the wrong column and put quizzes where exams should go. Maybe you found the maximum of two incorrect values. To solve this problem, you can use Python and pandas to do all your calculations and find and fix those mistakes much faster.

In this tutorial, you’ll learn how to:

  • Load and merge data from multiple sources with pandas
  • Filter and group data in a pandas DataFrame
  • Calculate and plot grades in a pandas DataFrame

Click the link below to download the code for this pandas project and follow along as you build your gradebook script:

Demo: What You’ll Build#

In this pandas project, you’re going to create a Python script that loads your grade data and calculates letter grades for your students. Check out this video for a demonstration of the script in action:

Your script will run from the command line or your IDE and will produce CSV output files so you can paste the grades into your school’s grading system. You’ll also produce a few plots to take a look at how your grades are distributed.

Project Overview#

This pandas project involves four main steps:

  1. Explore the data you’ll use in the project to determine which format and data you’ll need to calculate your final grades.
  2. Load the data into pandas DataFrames, making sure to connect the grades for the same student across all your data sources.
  3. Calculate the final grades and save them as CSV files.
  4. Plot the grade distribution and explore how the grades vary among your students.

Once you complete these steps, you’ll have a working Python script that can calculate your grades. Your grades will be in a format that you should be able to upload to your school’s student administration system.

Background Reading#

You’ll get the most out of this pandas project if you have a little bit of experience working with pandas. If you need a refresher, then these tutorials and courses will get you up to speed for this project:

Don’t worry too much about memorizing all the details in those tutorials. You’ll see a practical application of the topics in this pandas project. Now let’s take a look at the data you’ll be using in this project!

Exploring the Data for This Pandas Project#

Like most teachers, you probably used a variety of services to manage your class this term, including:

  • The school’s student administration system
  • A service to manage assigning and grading homework and exams
  • A service to manage assigning and grading quizzes

For the purposes of this project, you’ll use sample data that represents what you might get out of these systems. The data is in comma-separated values (CSV) files. Some samples of the data are shown here. First, there’s a file that contains the roster information for the class. This would come from your student administration system:

ID Name NetID Email Address Section
1234567 “Barrera Jr., Woody” WXB12345 WOODY.BARRERA_JR@UNIV.EDU 1
2345678 “Lambert, Malaika” MXL12345 MALAIKA.LAMBERT@UNIV.EDU 2
3456789 “Joyce, Traci” TXJ12345 TRACI.JOYCE@UNIV.EDU 1
4567890 “Flower, John Gregg” JGF12345 JOHN.G.2.FLOWER@UNIV.EDU 3

This table indicates each student’s ID number, name, NetID, and email address as well as the section of the class that they belong to. In this term, you taught one class that met at different times, and each class time has a different section number.

Next, you have a file that contains homework and exam scores. This one is from the homework and exam grading service and has a slightly different arrangement of columns than the roster:

SID First Name Last Name Homework 1 Homework 1 - Max Points Homework 1 - Submission Time
jgf12345 Gregg Flower 69 80 2019-08-29 08:56:02-07:00
mxl12345 Malaika Lambert 63 80 2019-08-29 08:56:02-07:00
txj12345 Traci Joyce 80 2019-08-29 08:56:02-07:00
wxb12345 Woody Barrera 55 80 2019-08-29 08:56:02-07:00

In this table, each student has an SID, first name, and last name. In addition, there are three values reported for each homework assignment and exam you gave:

  1. The score the student received
  2. The maximum score for that assignment
  3. The time the student submitted the assignment

Last, you have files that contain information for quiz grades. These files are separated so that one quiz is stored in each data file, and the information in these files is different from the roster and the homework files:

Last Name First Name Email Grade
Barrera Woody woody.barrera_jr@univ.edu 4
Flower John john.g.2.flower@univ.edu 8
Joyce Traci traci.joyce@univ.edu 8
Lambert Malaika malaika.lambert@univ.edu 8

In the quiz table, each student has a last name, first name, email, and quiz grade. Notice that the maximum possible quiz score isn’t stored in this table. You’ll see how to supply that information later on.

Inspecting this data, you might notice several features:

  • Each table has different representations of the students’ names. For instance, in the roster table the names are in the form "Last Name, First Name" with quotes so that a CSV parser doesn’t interpret the comma as a new column. However, in the homework table, first names and last names each get their own column.

  • Each student might use a different name in different data sources. For instance, the quiz tables don’t include the suffix Jr. in Woody Barrera’s name. Another example is that John Flower prefers to be called by his middle name, Gregg, so he adjusted the display in the homework table.

  • Each student’s email address doesn’t have the same elements. The basic email address for a student is first.last@univ.edu. However, if an email of that form is already owned by another student, then the email address is modified to be unique. This means you can’t predict a student’s email address just from their name.

  • Each column may use a unique name even if it has the same data. For instance, all the students have an identifier of the form abc12345. The roster table calls this their NetID, while the homework table calls this their SID. The quiz tables don’t have this information at all. Similarly, some tables use the column header Email address, while others just use Email.

  • Each table sorts the data differently. In the roster table, the data are sorted by the ID column. In the homework table, the data are sorted by the first letter of the first name. In the quiz tables, the data are sorted in a random order.

  • Each of the rows or columns in the tables may have missing data. For instance, Traci Joyce didn’t submit her work for Homework 1, so her row is blank in the homework table.

All of these features and more are present in data that you’ll see in the real world. In the rest of this pandas project, you’ll see how you can address each of these features and make sure they don’t disrupt your analysis.

Deciding on the Final Format for the Data#

Now that you’ve seen the raw data formats, you can think about the final format of the data. In the end, you’ll need to calculate a letter grade for each student from their raw scores. Each row in your final data table will contain all the data for a single student. The number of rows will then be equal to the number of students in your class.

The columns will represent each homework score, quiz score, and exam score. You’ll also store some information about each student, including their name and unique identifier. Finally, you’ll store each of your calculations and the final letter grade in separate columns.

Here’s a sample of your final table:

Identifier Name Homework Quizzes Exams Final Score Final Grade
Student 1 Last, First # # # # A–F
Student 2 Last, First # # # # A–F

Each row in the table stores all the data for a single student. The first column has the student’s unique identifier and the second column has the student’s name. Then a series of columns stores the homework, quiz, exam, and final scores. Last is a column for the final grade.

Now that you’ve seen what the final shape of the data will be, you can get started working with the data. The first step is to load the data!

Loading the Data With Pandas#

One of the best packages for working with tabular data in Python is pandas! You’re going to take advantage of a lot of the functionality in pandas, especially for merging datasets and performing mathematical operations with the data.

The code samples shown in this section are collected in the 01-loading-the-data.py file. You can download the source code by clicking the link below:

Create a Python script called gradebook.py. You’ll also need to create a folder called data that will store the input data files for your gradebook script.

Then, in gradebook.py, start by adding a module-level docstring that explains the purpose of the file. You can also import a few libraries right now:

"""Calculate student grades by combining data from many sources.

Using pandas, this script combines data from the:

* Roster
* Homework & Exam grades
* Quiz grades

to calculate final grades for a class.
"""
from pathlib import Path
import pandas as pd

In this code, you include a docstring that describes the purpose of the script. Then you import pathlib.Path and pandas.

Loading the Roster File#

Now you’re ready to load the data, beginning with the roster:

HERE = Path(__file__).parent
DATA_FOLDER = HERE / "data"

roster = pd.read_csv(
    DATA_FOLDER / "roster.csv",
    converters={"NetID": str.lower, "Email Address": str.lower},
    usecols=["Section", "Email Address", "NetID"],
    index_col="NetID",
)

In this code, you create two constants, HERE and DATA_FOLDER, to keep track of the location of the currently executing file as well as the folder where the data is stored. These constants use the pathlib module to make it easy to refer to different folders.

Then you read the roster file using pd.read_csv(). To help process the data later, you set an index using index_col and include only the useful columns with usecols.

To make sure you can compare strings later, you also pass the converters argument to convert columns to lowercase. This will simplify the string comparisons you’ll do later on.

You can see some of the data in the roster DataFrame below:

NetID Email Address Section
wxb12345 woody.barrera_jr@univ.edu 1
mxl12345 malaika.lambert@univ.edu 2
txj12345 traci.joyce@univ.edu 1
jgf12345 john.g.2.flower@univ.edu 3

These are the first four rows from roster, and they match the rows from the roster table you looked at in the previous section. However, the NetID and Email Address columns have both been converted to lowercase strings because you passed str.lower to converters for those two columns. You’ve also omitted the Name and ID columns.

Loading the Homework and Exam File#

Next, you can load the homework and exam grades CSV file. Remember that this file includes first and last names and the SID column in addition to all the grades. You want to ignore the columns with the submission times:

hw_exam_grades = pd.read_csv(
    DATA_FOLDER / "hw_exam_grades.csv",
    converters={"SID": str.lower},
    usecols=lambda x: "Submission" not in x,
    index_col="SID",
)

In this code, you again use the converters argument to convert the data in the SID and Email Address columns to lowercase. Although the data in these columns appear to be lowercase on first inspection, the best practice is to make sure that everything is consistent. You also need to specify SID as the index column to match the roster DataFrame.

In this CSV file, there are a number of columns containing assignment submission times that you won’t use in any further analysis. However, there are so many other columns that you want to keep that it would be tedious to list all of them explicitly.

To get around this, usecols also accepts functions that are called with one argument, the column name. If the function returns True, then the column is included. Otherwise, the column is excluded. With the lambda function you pass here, if the string "Submission" appears in the column name, then the column will be excluded.

Here’s a sample of the hw_exam_grades DataFrame to give you a sense of what the data looks like after it’s been loaded:

SID Homework 1 Homework 1 - Max Points Homework 2
jgf12345 69 80 52
mxl12345 63 80 57
txj12345 nan 80 77
wxb12345 55 80 62

These are the rows for the example students in the homework and exam grades CSV file you saw in the previous section. Notice that the missing data for Traci Joyce (SID txj12345) in the Homework 1 column was read as a nan, or Not a Number, value. You’ll see how to handle this kind of data in a later section. The ellipses (...) indicate columns of data that aren’t shown in the sample here but are loaded from the real data.

Loading the Quiz Files#

Last, you need to load the data from the quizzes. There are five quizzes that you need to read, and the most useful form of this data is a single DataFrame rather than five separate DataFrames. The final data format will look like this:

Email Quiz 5 Quiz 2 Quiz 4 Quiz 1 Quiz 3
woody.barrera_jr@univ.edu 10 10 7 4 11
john.g.2.flower@univ.edu 5 8 13 8 8
traci.joyce@univ.edu 4 6 9 8 14
malaika.lambert@univ.edu 6 10 13 8 10

This DataFrame has the Email column as the index, and each quiz is in a separate column. Notice that the quizzes are out of order, but you’ll see when you calculate the final grades that the order doesn’t matter. You can use this code to load the quiz files:

quiz_grades = pd.DataFrame()
for file_path in DATA_FOLDER.glob("quiz_*_grades.csv"):
    quiz_name = " ".join(file_path.stem.title().split("_")[:2])
    quiz = pd.read_csv(
        file_path,
        converters={"Email": str.lower},
        index_col=["Email"],
        usecols=["Email", "Grade"],
    ).rename(columns={"Grade": quiz_name})
    quiz_grades = pd.concat([quiz_grades, quiz], axis=1)

In this code, you create an empty DataFrame called quiz_grades. You need the empty DataFrame for the same reason that you need to create an empty list before using list.append().

You use Path.glob() to find all the quiz CSV files and load them with pandas, making sure to convert the email addresses to lowercase. You also set the index column for each quiz to the students’ email addresses, which pd.concat() uses to align data for each student.

Notice that you pass axis=1 to pd.concat(). This causes pandas to concatenate columns rather than rows, adding each new quiz into a new column in the combined DataFrame.

Finally, you use DataFrame.rename() to change the name of the grade column from Grade to something specific to each quiz.

Merging the Grade DataFrames#

Now that you have all your data loaded, you can combine the data from your three DataFrames, roster, hw_exam_grades, and quiz_grades. This lets you use one DataFrame for all your calculations and save a complete grade book to another format at the end.

All the modifications to gradebook.py made in this section are collected in the 02-merging-dataframes.py file. You can download the source code by clicking the link below:

You’ll merge the data together in two steps:

  1. Merge roster and hw_exam_grades together into a new DataFrame called final_data.
  2. Merge final_data and quiz_grades together.

You’ll use different columns in each DataFrame as the merge key, which is how pandas determines which rows to keep together. This process is necessary because each data source uses a different unique identifier for each student.

Merging the Roster and Homework Grades#

In roster and hw_exam_grades, you have the NetID or SID column as a unique identifier for a given student. When merging or joining DataFrames in pandas, it’s most useful to have an index. You already saw how useful this was when you were loading the quiz files.

Remember that you passed the index_col argument to pd.read_csv() when you loaded the roster and the homework grades. Now you can merge these two DataFrames together:

final_data = pd.merge(
    roster, hw_exam_grades, left_index=True, right_index=True,
)

In this code, you use pd.merge() to combine the roster and hw_exam_grades DataFrames.

Here’s a sample of the merged DataFrame for the four example students:

NetID Email Address Homework 1
wxb12345 woody.barrera_jr@univ.edu 55
mxl12345 malaika.lambert@univ.edu 63
txj12345 traci.joyce@univ.edu nan
jgf12345 john.g.2.flower@univ.edu 69

Like you saw before, the ellipses indicate columns that aren’t shown in the sample here but are present in the actual DataFrame. The sample table shows that students with the same NetID or SID have been merged together, so their email addresses and Homework 1 grades match the tables you saw previously.

Merging the Quiz Grades#

When you loaded the data for the quiz_grades, you used the email address as a unique identifier for each student. This is different from hw_exam_grades and roster, which used the NetID and SID, respectively.

To merge quiz_grades into final_data, you can use the index from quiz_grades and the Email Address column from final_data:

final_data = pd.merge(
    final_data, quiz_grades, left_on="Email Address", right_index=True
)

In this code, you use the left_on argument to pd.merge() to tell pandas to use the Email Address column in final_data in the merge. You also use right_index to tell pandas to use the index from quiz_grades in the merge.

Here’s a sample of the merged DataFrame showing the four example students:

NetID Email Address Homework 1 Quiz 3
wxb12345 woody.barrera_jr@univ.edu 55 11
mxl12345 malaika.lambert@univ.edu 63 10
txj12345 traci.joyce@univ.edu nan 14
jgf12345 john.g.2.flower@univ.edu 69 8

Remember that ellipses mean that columns are missing in the sample table here but will be present in the merged DataFrame. You can double-check the previous tables to verify that the numbers are aligned for the correct students.

Filling in nan Values#

Now all your data is merged into one DataFrame. Before you can move on to calculating the grades, you need to do one more bit of data cleaning.

You can see in the table above that Traci Joyce still has a nan value for her Homework 1 assignment. You can’t use nan values in calculations because, well, they’re not a number! You can use DataFrame.fillna() to assign a number to all the nan values in final_data:

final_data = final_data.fillna(0)

In this code, you use DataFrame.fillna() to fill all nan values in final_data with the value 0. This is an appropriate resolution because the nan value in Traci Joyce’s Homework 1 column indicates that the score is missing, meaning she probably didn’t hand in the assignment.

Here’s a sample of the modified DataFrame showing the four example students:

NetID First Name Last Name Homework 1
wxb12345 Woody Barrera 55
mxl12345 Malaika Lambert 63
txj12345 Traci Joyce 0
jgf12345 John Flower 69

As you can see in this table, Traci Joyce’s Homework 1 score is now 0 instead of nan, but the grades for the other students haven’t changed.

Calculating Grades With Pandas DataFrames#

There are three categories of assignments that you had in your class:

  1. Exams
  2. Homework
  3. Quizzes

Each of these categories is assigned a weight toward the students’ final score. For your class this term, you assigned the following weights:

Category Percent of Final Grade Weight
Exam 1 Score 5 0.05
Exam 2 Score 10 0.10
Exam 3 Score 15 0.15
Quiz Score 30 0.30
Homework Score 40 0.40

The final score can be calculated by multiplying the weight by the total score from each category and summing all these values. The final score will then be converted to a final letter grade.

All the modifications to gradebook.py made in this section are collected in the 03-calculating-grades.py file. You can download the source code by clicking the link below:

This means that you have to calculate the total from each category. The total from each category is a floating-point number from 0 to 1 that represents how many points a student earned relative to the maximum possible score. You’ll handle each assignment category in turn.

Calculating the Exam Total Score#

You’ll calculate grades for the exams first. Since each exam has a unique weight, you can calculate the total score for each exam individually. It makes the most sense to use a for loop, which you can see in this code:

n_exams = 3
for n in range(1, n_exams + 1):
    final_data[f"Exam {n} Score"] = (
        final_data[f"Exam {n}"] / final_data[f"Exam {n} - Max Points"]
    )

In this code, you set n_exams equal to 3 because you had three exams during the term. Then you loop through each exam to calculate the score by dividing the raw score by the max points for that exam.

Here’s a sample of the exam data for the four example students:

NetID Exam 1 Score Exam 2 Score Exam 3 Score
wxb12345 0.86 0.62 0.90
mxl12345 0.60 0.91 0.93
txj12345 1.00 0.84 0.64
jgf12345 0.72 0.83 0.77

In this table, each student scored between 0.0 and 1.0 on each of the exams. At the end of your script, you’ll multiply these scores by the weight to determine the proportion of the final grade.

Calculating the Homework Scores#

Next, you need to calculate the homework scores. The max points for each homework assignment varies from 50 to 100. This means that there are two ways to calculate the homework score:

  1. By total score: Sum the raw scores and maximum points independently, then take the ratio.
  2. By average score: Divide each raw score by its respective maximum points, then take the sum of these ratios and divide the total by the number of assignments.

The first method gives a higher score to students who performed consistently, while the second method favors students who did well on assignments that were worth more points. To help students, you’ll give them the maximum of these two scores.

Computing these scores will take a few steps:

  1. Collect the columns with homework data.
  2. Compute the total score.
  3. Compute the average score.
  4. Determine which score is larger and will be used in the final score calculation.

First, you need to collect all the columns with homework data. You can use DataFrame.filter() to do this:

homework_scores = final_data.filter(regex=r"^Homework \d\d?$", axis=1)
homework_max_points = final_data.filter(regex=r"^Homework \d\d? -", axis=1)

In this code, you use a regular expression (regex) to filter final_data. If a column name doesn’t match the regex, then the column won’t be included in the resulting DataFrame.

The other argument you pass to DataFrame.filter() is axis. Many methods of a DataFrame can operate either row-wise or column-wise, and you can switch between the two approaches using the axis argument. With the default argument axis=0, pandas would look for rows in the index that match the regex you passed. Since you want to find all the columns that match the regex instead, you pass axis=1.

Now that you’ve collected the columns you need from the DataFrame, you can do the calculations with them. First, you sum the two values independently and then divide them to compute the total homework score:

sum_of_hw_scores = homework_scores.sum(axis=1)
sum_of_hw_max = homework_max_points.sum(axis=1)
final_data["Total Homework"] = sum_of_hw_scores / sum_of_hw_max

In this code, you use DataFrame.sum() and pass the axis argument. By default, .sum() will add up the values for all the rows in each column. However, you want the sum of all the columns for each row because each row represents one student. The axis=1 argument tells pandas to do just that.

Then you assign a new column in final_data called Total Homework to the ratio of the two sums.

Here’s a sample of the calculation results for the four example students:

NetID Sum of Homework Scores Sum of Max Scores Total Homework
wxb12345 598 740 0.808108
mxl12345 612 740 0.827027
txj12345 581 740 0.785135
jgf12345 570 740 0.770270

In this table, you can see the sum of the homework scores, the sum of the max scores, and the total homework score for each student.

The other calculation method is to divide each homework score by its maximum score, add up these values, and divide the total by the number of assignments. To do this, you could use a for loop and go through each column. However, pandas allows you to be more efficient because it will match column and index labels and perform mathematical operations only on matching labels.

To make this work, you need to change the column names for homework_max_points to match the names in homework_scores. You can do this using DataFrame.set_axis():

hw_max_renamed = homework_max_points.set_axis(homework_scores.columns, axis=1)

In this code, you create a new DataFrame, hw_max_renamed, and you set the columns axis to have the same names as the columns in homework_scores. Now you can use this DataFrame for more calculations:

average_hw_scores = (homework_scores / hw_max_renamed).sum(axis=1)

In this code, you calculate the average_hw_scores by dividing each homework score by its respective maximum points. Then you add the ratios together for all the homework assignments in each row with DataFrame.sum() and the argument axis=1.

Since the maximum value on each individual assignment is 1.0, the maximum value that this sum could take would equal the total number of homework assignments. However, you need a number that’s scaled from 0 to 1 to factor into the final grade.

This means you need to divide average_hw_scores by the number of assignments, which you can do with this code:

final_data["Average Homework"] = average_hw_scores / homework_scores.shape[1]

In this code, you use DataFrame.shape to get the number of assignments from homework_scores. Like a NumPy array, DataFrame.shape returns a tuple of (n_rows, n_columns). Taking the second value from the tuple gives you the number of columns in homework_scores, which is equal to the number of assignments.

Then you assign the result of the division to a new column in final_data called Average Homework.

Here’s a sample calculation result for the four example students:

NetID Sum of Average Homework Scores Average Homework
wxb12345 7.99405 0.799405
mxl12345 8.18944 0.818944
txj12345 7.85940 0.785940
jgf12345 7.65710 0.765710

In this table, notice that the Sum of Average Homework Scores can vary from 0 to 10, but the Average Homework column varies from 0 to 1. The second column will be used to compare to Total Homework next.

Now that you have your two homework scores calculated, you can take the maximum value to be used in the final grade calculation:

final_data["Homework Score"] = final_data[
    ["Total Homework", "Average Homework"]
].max(axis=1)

In this code, you select the two columns you just created, Total Homework and Average Homework, and assign the maximum value to a new column called Homework Score. Notice that you take the maximum for each student with axis=1.

Here’s a sample of the calculated results for the four example students:

NetID Total Homework Average Homework Homework Score
wxb12345 0.808108 0.799405 0.808108
mxl12345 0.827027 0.818944 0.827027
txj12345 0.785135 0.785940 0.785940
jgf12345 0.770270 0.765710 0.770270

In this table, you can compare the Total Homework, Average Homework, and final Homework Score columns. You can see that the Homework Score always reflects the larger of Total Homework or Average Homework.

Calculating the Quiz Score#

Next, you need to calculate the quiz score. The quizzes also have different numbers of maximum points, so you need to do the same procedure you did for the homework. The only difference is that the maximum grade on each quiz isn’t specified in the quiz data tables, so you need to create a pandas Series to hold that information:

quiz_scores = final_data.filter(regex=r"^Quiz \d$", axis=1)
quiz_max_points = pd.Series(
    {"Quiz 1": 11, "Quiz 2": 15, "Quiz 3": 17, "Quiz 4": 14, "Quiz 5": 12}
)

sum_of_quiz_scores = quiz_scores.sum(axis=1)
sum_of_quiz_max = quiz_max_points.sum()
final_data["Total Quizzes"] = sum_of_quiz_scores / sum_of_quiz_max

average_quiz_scores = (quiz_scores / quiz_max_points).sum(axis=1)
final_data["Average Quizzes"] = average_quiz_scores / quiz_scores.shape[1]

final_data["Quiz Score"] = final_data[
    ["Total Quizzes", "Average Quizzes"]
].max(axis=1)

Most of this code is quite similar to the homework code from the last section. The main difference from the homework case is that you created a pandas Series for quiz_max_points using a dictionary as input. The keys of the dictionary become index labels and the dictionary values become the Series values.

Since the index labels in quiz_max_points have the same names as quiz_scores, you don’t need to use DataFrame.set_axis() for the quizzes. pandas also broadcasts the shape of a Series so that it matches the DataFrame.

Here’s a sample of the result of this calculation for the quizzes:

NetID Total Quizzes Average Quizzes Quiz Score
wxb12345 0.608696 0.602139 0.608696
mxl12345 0.681159 0.682149 0.682149
txj12345 0.594203 0.585399 0.594203
jgf12345 0.608696 0.615286 0.615286

In this table, the Quiz Score is always the larger of Total Quizzes or Average Quizzes, as expected.

Calculating the Letter Grade#

Now you’ve completed all the required calculations for the final grade. You have scores for the exams, homework, and quizzes that are all scaled between 0 and 1. Next, you need to multiply each score by its weighting to determine the final grade. Then you can map that value onto a scale for letter grades, A through F.

Similar to the maximum quiz scores, you’ll use a pandas Series to store the weightings. That way, you can multiply by the correct columns from final_data automatically. Create your weightings with this code:

weightings = pd.Series(
    {
        "Exam 1 Score": 0.05,
        "Exam 2 Score": 0.1,
        "Exam 3 Score": 0.15,
        "Quiz Score": 0.30,
        "Homework Score": 0.4,
    }
)

In this code, you give a weighting to each component of the class. As you saw earlier, Exam 1 is worth 5 percent, Exam 2 is worth 10 percent, Exam 3 is worth 15 percent, quizzes are worth 30 percent, and Homework is worth 40 percent of the overall grade.

Next, you can combine these percentages with the scores you calculated previously to determine the final score:

final_data["Final Score"] = (final_data[weightings.index] * weightings).sum(
    axis=1
)
final_data["Ceiling Score"] = np.ceil(final_data["Final Score"] * 100)

In this code, you select the columns of final_data that have the same names as the index in weightings. You need to do this because some of the other columns in final_data have type str, so pandas will raise a TypeError if you try to multiply weightings by all of final_data.

Next, you take the sum of these columns for each student with DataFrame.sum(axis=1) and you assign the result of this to a new column called Final Score. The value in this column for each student is a floating-point number between 0 and 1.

Finally, being the really nice teacher that you are, you’re going to round each student’s grade up. You multiply each student’s Final Score by 100 to put it on a scale from 0 to 100, then you use numpy.ceil() to round each score to the next highest integer. You assign this value to a new column called Ceiling Score.

Here’s a sample calculation result for these columns for the four example students:

NetID Final Score Ceiling Score
wxb12345 0.745852 75
mxl12345 0.795956 80
txj12345 0.722637 73
jgf12345 0.727194 73

The last thing to do is to map each student’s ceiling score onto a letter grade. At your school, you might use these letter grades:

  • A: Score of 90 or higher
  • B: Score between 80 and 90
  • C: Score between 70 and 80
  • D: Score between 60 and 70
  • F: Score below 60

Since each letter grade has to map to a range of scores, you can’t easily use just a dictionary for the mapping. Fortunately, pandas has Series.map(), which allows you to apply an arbitrary function to the values in a Series. You could do something similar if you used a different grading scale than letter grades.

You can write an appropriate function this way:

grades = {
    90: "A",
    80: "B",
    70: "C",
    60: "D",
    0: "F",
}

def grade_mapping(value):
    for key, letter in grades.items():
        if value >= key:
            return letter

In this code, you create a dictionary that stores the mapping between the lower limit of each letter grade and the letter. Then you define grade_mapping(), which takes as an argument the value of a row from the ceiling score Series. You loop over the items in grades, comparing value to the key from the dictionary. If value is greater than key, then the student falls in that bracket and you return the appropriate letter grade.

With grade_mapping() defined, you can use Series.map() to find the letter grades:

letter_grades = final_data["Ceiling Score"].map(grade_mapping)
final_data["Final Grade"] = pd.Categorical(
    letter_grades, categories=grades.values(), ordered=True
)

In this code, you create a new Series called letter_grades by mapping grade_mapping() onto the Ceiling Score column from final_data. Since there are five choices for a letter grade, it makes sense for this to be a categorical data type. Once you’ve mapped the scores to letters, you can create a categorical column with the pandas Categorical class.

To create the categorical column, you pass the letter grades as well as two keyword arguments:

  1. categories is passed the values from grades. The values in grades are the possible letter grades in the class.
  2. ordered is passed True to tell pandas that the categories are ordered. This will help later if you want to sort this column.

The categorical column that you create is assigned to a new column in final_data called Final Grade.

Here are the final grades for the four example students:

NetID Final Score Ceiling Score Final Grade
wxb12345 0.745852 75 C
mxl12345 0.795956 80 B
txj12345 0.722637 73 C
jgf12345 0.727194 73 C

Among the four example students, one person got a B and three people got Cs, matching their ceiling scores and the letter grade mapping you created.

Grouping the Data#

Now that you’ve calculated the grades for each student, you probably need to put them into the student administration system. This term, you’re teaching several sections of the same class, as indicated by the Section column in the roster table.

All of the modifications to gradebook.py made in this section are collected in the 04-grouping-the-data.py file. You can download the source code by clicking the link below:

To put the grades into your student administration system, you need to separate the students into each section and sort them by their last name. Fortunately, pandas has you covered here as well.

pandas has powerful abilities to group and sort data in DataFrames. You need to group your data by the students’ section number and sort the grouped result by their name. You can do that with this code:

for section, table in final_data.groupby("Section"):
    section_file = DATA_FOLDER / f"Section {section} Grades.csv"
    num_students = table.shape[0]
    print(
        f"In Section {section} there are {num_students} students saved to "
        f"file {section_file}."
    )
    table.sort_values(by=["Last Name", "First Name"]).to_csv(section_file)

In this code, you use DataFrame.groupby() on final_data to group by the Section column and DataFrame.sort_values() to sort the grouped results. Last, you save the sorted data to a CSV file for upload to the student administration system. With that, you’re done with your grades for the term and you can relax for the break!

Plotting Summary Statistics#

Before you hang up the whiteboard marker for the summer, though, you might like to see a little bit more about how the class did overall. Using pandas and Matplotlib, you can plot some summary statistics for the class.

All of the modifications made to gradebook.py in this section are collected in the 05-plotting-summary-statistics.py file. You can download the source code by clicking the link below:

First, you might want to see a distribution of the letter grades in the class. You can do that with this code:

grade_counts = final_data["Final Grade"].value_counts().sort_index()
grade_counts.plot.bar()
plt.show()

In this code, you use Series.value_counts() on the Final Grade column in final_data to calculate how many of each of the letters appear. By default, the value counts are sorted from most to fewest, but it would be more useful to see them in letter-grade order. You use Series.sort_index() to sort the grades into the order that you specified when you defined the Categorical column.

You then leverage pandas’s ability to use Matplotlib and produce a bar plot of the grade counts with DataFrame.plot.bar(). Since this is a script, you need to tell Matplotlib to show you the plot with plt.show(), which opens an interactive figure window.

Your figure should look similar to the figure below:

Histogram of Letter Grades

The height of the bars in this figure represents the number of students who received each letter grade shown on the horizontal axis. The majority of your students got a C letter grade.

Next, you might want to see a histogram of the numerical scores of the students. pandas can use Matplotlib with DataFrame.plot.hist() to do that automatically:

final_data["Final Score"].plot.hist(bins=20, label="Histogram")

In this code, you use DataFrame.plot.hist() to plot a histogram of the final scores. Any keyword arguments are passed through to Matplotlib when the plotting is done.

A histogram is one way to estimate the distribution of the data, but you might be interested in more sophisticated methods as well. pandas has the ability to use the SciPy library to calculate a kernel density estimate with DataFrame.plot.density(). You can also guess that the data will be normally distributed and manually calculate a normal distribution with the mean and standard deviation from your data. You can try this code to see how it works:

final_data["Final Score"].plot.density(
    linewidth=4, label="Kernel Density Estimate"
)

final_mean = final_data["Final Score"].mean()
final_std = final_data["Final Score"].std()
x = np.linspace(final_mean - 5 * final_std, final_mean + 5 * final_std, 200)
normal_dist = scipy.stats.norm.pdf(x, loc=final_mean, scale=final_std)
plt.plot(x, normal_dist, label="Normal Distribution", linewidth=4)
plt.legend()
plt.show()

In this code, you first use DataFrame.plot.density() to plot the kernel density estimate for your data. You adjust the line width and label for the plot to make it easier to see.

Next, you calculate the mean and standard deviation of your Final Score data using DataFrame.mean() and DataFrame.std(). You use np.linspace() to generate a set of x-values from -5 to +5 standard deviations away from the mean. Then you calculate the normal distribution in normal_dist by plugging into the formula for the standard normal distribution.

Finally, you plot x vs normal_dist and adjust the line width and add a label. Once you show the plot, you should get a result that looks like this:

Numerical Grade Histogram With PDF Estimates

In this figure, the vertical axis shows the density of the grades in a particular bin. The peak occurs near a grade of 0.78. Both the kernel density estimate and the normal distribution do a pretty good job of matching the data.

Conclusion#

You now know how to build a gradebook script with pandas so you can stop using spreadsheet software. This will help you avoid errors and calculate your final grades more quickly in the future.

In this tutorial, you learned:

  • How to load, clean, and merge data into pandas DataFrames
  • How to calculate with DataFrames and Series
  • How to map values from one set to another
  • How to plot summary statistics using pandas and Matplotlib

In addition, you saw how to group data and save files to upload to your student administration system. Now you’re ready to create your pandas gradebook for next term!

Click the link below to download the code for this pandas project and learn how to build a gradebook without spreadsheets:

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Bryan Weber

Bryan Weber Bryan Weber

Bryan is a mechanical engineering professor and a core developer of Cantera, the open-source platform for thermodynamics, chemical kinetics, and transport.

» More about Bryan

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

Master Real-World Python Skills
With Unlimited Access to Real Python

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Real Python Comment Policy: The most useful comments are those written with the goal of learning from or helping out other readers—after reading the whole article and all the earlier comments. Complaints and insults generally won’t make the cut here.

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Keep Learning

Related Tutorial Categories: advanced data-science projects