Python Practice Problems: Parsing CSV Files

Are you a developer looking for some practice with comma-separated values (CSV) files before an upcoming interview? This tutorial will lead you through a series of Python CSV practice problems to help you get ready.

This tutorial is aimed at intermediate Python developers. It assumes a basic knowledge of Python and working with CSV files. Like other practice problem tutorials, each of the problems listed here shows the problem description. You’ll see the problem statement first and then have a chance to develop your own solution.

In this tutorial, you’ll explore:

Writing code for working with CSV files
Doing test-driven development with pytest
Discussing your solutions and possible enhancements
The trade-offs between the built-in CSV module and pandas

You can get skeleton code with failing unit tests for each of the problems you’ll see in this tutorial by clicking the link below:

Get the Source Code: Click here to get the source code you’ll use to practice parsing CSV files in this tutorial.

Python CSV Parsing: Football Scores

Your first problem deals with English Premier League team standings. You don’t need any special football knowledge to solve this, just Python!

As you work through the problem, try to write more unit tests for each bit of functionality and then write the functionality to make the tests pass. This is known as test-driven development, and it can be a great way to show off not only your coding but also your testing chops!

Remove ads

Problem Description

For this round of the problem, stick to the standard library csv module. You’ll get another shot at it using pandas later. Here’s your first problem:

Find the Minimum Goal Differential

Write a program that takes a filename on the command line and processes the contents of a CSV file. The contents will be the end-of-season football standings for the English Premier League. Your program should determine which team had the smallest goal differential that season.

The first line of the CSV file will be column headers, with each subsequent line showing the data for one team:
CSV
Team,Games,Wins,Losses,Draws,Goals For,Goals Against
Arsenal,38,26,9,3,79,36
The columns labeled Goals For and Goals Against contain the total number of goals scored for and against each team in that season. (So Arsenal scored 79 goals and had 36 goals scored against them.)

Write a program to read the file, then print the name of the team with the smallest difference in Goals For and Goals Against. Create unit tests with pytest to test your program.

There is a single unit test supplied in the skeleton code that tests the problem statement that you’ll see later. You can add more as you write your solution. There are also two pytest fixtures given:

Python
      
    
# test_football_v1.py
import pytest
import football_v1 as fb

@pytest.fixture
def mock_csv_data():
    return [
        "Team,Games,Wins,Losses,Draws,Goals For,Goals Against",
        "Liverpool FC, 38, 32, 3, 3, 85, 33",
        "Norwich City FC, 38, 5, 27, 6, 26, 75",
    ]

@pytest.fixture
def mock_csv_file(tmp_path, mock_csv_data):
    datafile = tmp_path / "football.csv"
    datafile.write_text("\n".join(mock_csv_data))
    return str(datafile)

The first fixture supplies a list of strings that mocks real CSV data, and the second supplies a filename backed by that test data. Each string in the list of strings represents a line of the test file.

Note: The solutions here will have a non-exhaustive set of tests for them that will prove out basic functionality only. For a real system, you would likely want a more complete test suite, possibly making use of parametrization.

Remember that the provided fixtures are only a start. Add unit tests that use them as you design each part of the solution!

Problem Solution

Here’s a discussion of the solution that the Real Python team arrived at and how the team got there.

Note: Remember, don’t open the collapsed section below until you’re ready to look at the answers for each of the Python practice problems!

How did the score parsing go? Are you ready to see the answers the Real Python team came up with?

While solving this problem, the team came up with a couple of solutions by writing and then rewriting the code a few times. In an interview situation, you’ll frequently get only one shot. One technique you can use to address this in a live coding situation is to take a moment and talk about other implementation options you might use now that you see a solution.

Solution 1

You’ll examine two different solutions to this problem. The first solution you’ll walk through works well but has some room for improvement. You’re going to use a test-driven development (TDD) model here, so you won’t look at the full solution first but rather just the overall plan of what the solution looks like.

Breaking the solution into parts allows you to write unit tests for each piece before you write the code. Here’s the general outline of this solution:

Read and parse each line of the CSV file in a generator.
Compute the team name and score differential for a given line.
Find the minimum score differential.

Let’s start with the first part, reading and parsing the file one line at a time. You’ll first build tests for that operation.

Read and Parse

Given the description of the problem, you know what the columns are ahead of time, so you don’t need the first line of labels in your output. You also know that each line of data will have seven fields, so you can test that your parsing function returns a list of lines, each of which is a list with seven entries:

Python
      
    
# test_football_v1.py
import pytest
import football_v1 as fb

# ...

def test_parse_next_line(mock_csv_data):
    all_lines = [line for line in fb.parse_next_line(mock_csv_data)]
    assert len(all_lines) == 2
    for line in all_lines:
        assert len(line) == 7

You can see that this test is using your first pytest fixture, which supplies a list of CSV lines. This test takes advantage of the fact that the CSV module can parse either a list object or a file object. This is quite handy for your tests as you don’t have to worry about managing file objects just yet.

The test uses a list comprehension to read all the lines from parse_next_line(), which will be a generator. It then asserts a few properties on this list:

There are two entries in the list.
Each entry is itself a list of seven items.

Now that you have a test, you can run it to confirm that it runs and that it fails as expected:

Shell
      
        
      
    
$ pytest test_football_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 1 item

test_football_v1.py F                                                   [100%]

=================================== FAILURES ===================================
_______________________________ test_parse_next_line ___________________________

mock_csv_data = ['Team,Games,Wins,Losses,Draws,Goals For,Goals Against', ....

    def test_parse_next_line(mock_csv_data ):
>       all_lines = [line for line in fb.parse_next_line(mock_csv_data)]
E       AttributeError: module 'football_v1' has no attribute 'parse_next_line'

test_football_csv.py:30: AttributeError
=========================== short test summary info ============================
FAILED test_football_v1.py::test_parse_next_line - AttributeError: module 'fo...
============================== 1 failed in 0.02s ===============================

The test fails because parse_next_line() is undefined, which makes sense given that you haven’t written it yet. Running the tests when you know they’ll fail gives you confidence that, when the tests finally pass, the changes you’ve made are what fixed them.

Note: The pytest output above assumes you have a file named football_v1.py, but that it does not contain a function parse_next_line(). If you don’t have this file, you’ll likely get an error stating ModuleNotFoundError: No module named 'football_v1'.

Next you’ll write the missing parse_next_line(). This function will be a generator returning a parsed version of each line of the file. You’ll need to add some code to skip over the header:

Python
      
    
# football_v1.py
import csv

def parse_next_line(csv_file):
    for line in csv.DictReader(csv_file):
        yield line

The function starts by creating a csv.DictReader(), which is an iterator over the CSV file. DictReader uses the header line as keys into the dictionary it creates. Each subsequent line of the file then builds a dictionary with those keys and the corresponding values. This dictionary is what gets yielded to create your generator.

Now try it out with your unit test:

Shell
      
$ pytest test_football_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 1 item

test_football_v1.py .                                                    [100%]

============================== 1 passed in 0.01s ===============================

Excellent! Your first block of functionality is working. You know that the code you added is what makes the test pass. Now you can move on to the next step, computing the score differential for a given line.

Compute Differential

This function will take a list of values parsed by parse_next_line() and compute the score differential Goals For - Goals Against. This is where those test fixtures with a small amount of representative data will help. You can hand-compute the differences of the two lines in your test data to get differentials of 52 for Liverpool FC and 49 for Norwich City FC.

The test for this will use the generator function you just completed to pull each line from the test data:

Python
      
    
# test_football_v1.py
import pytest
import football_v1 as fb

# ...

def test_get_score_difference(mock_csv_data):
    reader = fb.parse_next_line(mock_csv_data)
    assert fb.get_name_and_diff(next(reader)) == ("Liverpool FC", 52)
    assert fb.get_name_and_diff(next(reader)) == ("Norwich City FC", 49)

You start by creating the generator you just tested and then use next() to walk through the two lines of the test data. The assert statements test that each of the hand-computed values is correct.

As before, once you have the test, you can run it to ensure it fails:

Shell
      
        
      
    
$ pytest test_football_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 2 items

test_football_v1.py .F                                                   [100%]

=================================== FAILURES ===================================
__________________________ test_get_score_difference ___________________________

mock_csv_data = ['Team,Games,Wins,Losses,Draws,Goals For,Goals Against', ...

    def test_get_score_difference(mock_csv_data):
        reader = fb.parse_next_line(mock_csv_data)
>       team, diff = fb.get_name_and_diff(next(reader))
E       AttributeError: module 'football_v1' has no attribute 'get_name_and ...

test_football_v1.py:38: AttributeError
=========================== short test summary info ============================
FAILED test_football_v1.py::test_get_score_difference - AttributeError: modul...
========================= 1 failed, 1 passed in 0.03s ==========================

Now that the test is in place, take a look at the implementation of get_name_and_diff(). Since the DictReader put the CSV values into dictionaries for you, you can retrieve the team name and compute the goal differential from each dictionary:

Python
      
    
# football_v1.py
def get_name_and_diff(team_stats):
    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
    return team_stats["Team"], diff

You could write this as a one-liner, but it might improve readability to split it out into clear fields. It can also make debugging this code easier. These are excellent points to raise if you’re live-coding this during an interview. Showing that you give some thought to readability can make a difference.

Now that you have an implementation for this functionality, you can rerun your tests:

Shell
      
        
      
    
$ pytest test_football_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 2 items

test_football_v1.py .F                                                   [100%]

=================================== FAILURES ===================================
__________________________ test_get_score_difference ___________________________

mock_csv_data = ['Team,Games,Wins,Losses,Draws,Goals For,Goals Against', ...

    def test_get_score_difference(mock_csv_data):
        reader = fb.parse_next_line(mock_csv_data)
        assert fb.get_name_and_diff(next(reader)) == ("Liverpool FC", 52)
>       assert fb.get_name_and_diff(next(reader)) == ("Norwich City FC", 49)
E       AssertionError: assert ('Norwich City FC', -49) == ('Norwich City FC'...
E         At index 1 diff: -49 != 49
E         Use -v to get the full diff

test_football_v1.py:40: AssertionError
=========================== short test summary info ============================
FAILED test_football_v1.py::test_get_score_difference - AssertionError: asser...
========================= 1 failed, 1 passed in 0.07s ==========================

Whoops! That’s not right. The returned difference from the function should not be negative. Good thing you wrote tests!

You can correct that by using abs() on the returned value:

Python
      
    
# football_v1.py
def get_name_and_diff(team_stats):
    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
    return team_stats["Team"], abs(diff)

You can see in the final line of the function that it now calls abs(diff) so you won’t get negative numbers for results. Now try this version with your test and see if it passes:

Shell
      
$ pytest test_football_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 2 items

test_football_v1.py ..                                                   [100%]

============================== 2 passed in 0.01s ===============================

That’s much better. You want the absolute value of the difference if you want to find the team with the smallest goal differential.

Find Minimum

For your final piece of the puzzle, you need a function that uses your generator to get each line of the CSV file and your function to return the team name and score differential for each line and then finds the minimum of these differentials. The test for this is the overall test given in the skeleton code:

Python
      
    
# test_football_v1.py
import pytest
import football_v1 as fb

# ...

def test_get_min_score(mock_csv_file):
    assert fb.get_min_score_difference(mock_csv_file) == (
        "Norwich City FC",
        49,
    )

Once again, you use the provided pytest fixtures, but this time you use the mock_csv_file fixture to get the filename of a file containing the same set of test data you’ve been using so far. This test calls your final function and asserts the correct answer you computed by hand: Norwich City FC had the smallest score differential at 49 goals.

By this point, you’ve seen tests failing before the tested function is implemented, so you can skip that step and jump to your solution:

Python
      
    
# football_v1.py
def get_min_score_difference(filename):
    with open(filename, "r", newline="") as csv_file:
        min_diff = 10000
        min_team = None
        for line in parse_next_line(csv_file):
            team, diff = get_name_and_diff(line)
            if diff < min_diff:
                min_diff = diff
                min_team = team
    return min_team, min_diff

This function uses a context manager to open the given CSV file for reading. It then sets up the min_diff and min_team variables you’ll use to keep track of the minimum you’ve found as you walk through the list. You start the minimum differential at 10000, which seems safe for football scores.

The function then walks through each line, getting the team name and differential, and finds the minimum value for the differential.

When you run this code against the test, it passes:

Shell
      
 $ pytest test_football_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 3 items

test_football_v1.py ...                                                  [100%]

============================== 3 passed in 0.03s ===============================

Congratulations! You’ve achieved a solution to the stated problem!

Once you’ve done that, especially in an interview situation, it’s time to examine your solution and see if you can identify changes that will make the code more readable, more robust, or more Pythonic. That’s what you’ll do in the next section.

Solution 2: Refactoring Solution 1

Take a look at your first solution for this problem as a whole:

Python
      
    
# football_v1.py
import csv

def parse_next_line(csv_file):
    for line in csv.DictReader(csv_file):
        yield line

def get_name_and_diff(team_stats):
    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
    return team_stats["Team"], abs(diff)

def get_min_score_difference(filename):
    with open(filename, "r", newline="") as csv_file:
        min_diff = 10000
        min_team = None
        for line in parse_next_line(csv_file):
            team, diff = get_name_and_diff(line)
            if diff < min_diff:
                min_diff = diff
                min_team = team
    return min_team, min_diff

Looking at this code as a whole, there are a few things to notice. One of them is that get_name_and_diff() doesn’t do all that much. It only pulls three fields out of a dictionary and subtracts. The first function, parse_next_line() is also fairly short, and it seems possible you could combine these two, having the generator only return the team name and score differential.

You can refactor those two functions into a new one called get_next_name_and_diff(). If you’re following along with this tutorial, now is a good time to copy football_v1.py to football_v2.py and do similarly with the test files. Sticking with your TDD process, you’ll reuse the tests you had for your first solution:

Python
      
    
# test_football_v2.py
import pytest
import football_v2 as fb

 # ...

def test_get_min_score(mock_csv_file):
    assert fb.get_min_score_difference(mock_csv_file) == (
        "Norwich City FC",
        49,
    )

def test_get_score_difference(mock_csv_data):
    reader = fb.get_next_name_and_diff(mock_csv_data)
    assert next(reader) == ("Liverpool FC", 52)
    assert next(reader) == ("Norwich City FC", 49)
    with pytest.raises(StopIteration):
        next(reader)

The first test, test_get_min_score(), remains the same since it tests the top level of functionality, which is unchanged.

The other two test functions get merged into a single function, combining the tests for the number of items returned and the values returned into a single test. It uses the generator returned from get_next_name_and_diff() directly with the help of Python’s built-in next().

Here’s what it looks like when you slam those two non-test functions together:

Python
      
    
# football_v2.py
import csv

def get_next_name_and_diff(csv_file):
    for team_stats in csv.DictReader(csv_file):
        diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
        yield team_stats["Team"], abs(diff)

This function does look like the previous functions jammed together. It uses csv.DictReader() and, instead of yielding the dict created from each line, yields only the team name and the computed differential.

While that’s not a huge improvement in terms of readability, it’ll allow you to do some other simplification in the remaining function.

The remaining function, get_min_score_difference(), has some room for improvement as well. Manually walking through a list to find a minimum is something that the standard library provides. Fortunately, this is the top-level functionality, so your tests don’t need to change.

As mentioned above, you can use min() from the standard library to find the smallest item in a list or iterable. The “or iterable” part is significant. Your get_next_name_and_diff() generator qualifies as an iterable, so min() will run through the generator and find the minimal result.

One issue is that get_next_name_and_diff() yields (team_name, score_differential) tuples, and you want to minimize the differential value. To facilitate this use case, min() has a keyword parameter, key. You can supply a function, or in your case a lambda, to indicate which values it’ll use to search for the minimum value:

Python
      
    
# football_v2.py
def get_min_score_difference(filename):
    with open(filename, "r", newline="") as csv_data:
        return min(get_next_name_and_diff(csv_data), key=lambda item: item[1])

This change collapses the code down to a much smaller, more Pythonic function. The lambda used for key allows min() to find the minimum value of the score differentials. Running pytest against the new code shows that it still solves the stated problem:

Shell
      
$ pytest test_football_v2.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 3 items

test_football_v2.py ...                                                  [100%]

============================== 3 passed in 0.01s ===============================

Taking the time to examine and refactor code in this manner is a great practice to use in day-to-day coding, but it may or may not be practical in an interview setting. Even if you don’t feel you have the time or energy during an interview to completely refactor your solution, it’s worth spending a little bit of time showing the interviewer that you’re thinking along these lines.

When you’re in an interview, taking a minute to point out, “These functions are small—I could merge them,” or, “If I pushed on this explicit loop, then I could use the min() functionality” will show the interviewer that you know these things. No one comes up with the optimal solution on the first attempt.

Another topic that would be worth discussing in an interview is corner cases. Does the solution handle bad data lines? Topics like this make for good tests and can catch lots of problems early. Sometimes discussing them in an interview situation is sufficient, and sometimes it would pay to go back and refactor your tests and code to handle these issues.

You also might want to discuss the problem definition. In particular, this problem has an ambiguous specification. What should the solution be if two teams share the same differential? The solutions you’ve seen here select the first one, but it’s possible to return all of them, or the last one, or some other decision.

This type of ambiguity is quite common in real projects, so recognizing it and bringing it up as a topic can be a sign that you’re thinking beyond just the code solution.

Now that you’ve used the Python csv module to solve one problem, try again with a similar problem.

Python CSV Parsing: Weather Data

Your second problem will look fairly similar to your first. It might be a good idea to use a similar structure to solve it. Once you’ve walked through a solution for this problem, you’ll read about some ideas for refactoring to reuse code, so keep that in mind as you work.

Problem Description

This problem involves parsing weather data in a CSV file:

Highest Average Temperature

Write a program that takes a filename on the command line and processes the contents of a CSV file. The contents will be a month of weather data, one day per line.

Your program should determine which day had the highest average temperature, where the average temperature is the average of the day’s high and low temperatures. This is not normally how average temperature is computed, but it’ll work for this demonstration.

The first line of the CSV file will be column headers:
CSV
Day,MaxT,MinT,AvDP,1HrP TPcn,PDir,AvSp,Dir,MxS,SkyC,MxR,Mn,R AvSLP
1,88,59,74,53.8,0,280,9.6,270,17,1.6,93,23,1004.5
The day number, max temperature, and min temperature are the first three columns.

Write unit tests with pytest to test your program.

As with the football scores problem, there are unit tests supplied in the skeleton code that test the problem statement:

Python
      
    
# test_weather_v1.py
import pytest
import weather_v1 as wthr

@pytest.fixture
def mock_csv_data():
    return [
        "Day,MxT,MnT,AvT,AvDP,1HrP TPcn,PDir,AvSp,Dir,MxS,SkyC,MxR,Mn,R AvSLP",
        "1,88,59,74,53.8,0,280,9.6,270,17,1.6,93,23,1004.5",
        "2,79,63,71,46.5,0,330,8.7,340,23,3.3,70,28,1004.5",
    ]

@pytest.fixture
def mock_csv_file(tmp_path, mock_csv_data):
    datafile = tmp_path / "weather.csv"
    datafile.write_text("\n".join(mock_csv_data))
    return str(datafile)

Again, note that there are two fixtures given. The first supplies a list of strings that mocks real CSV data, and the second supplies a filename backed by that test data. Each string in the list of strings represents a line of the test file.

Remember that the provided fixtures are only a start. Add tests as you design each part of the solution!

Remove ads

Problem Solution

Here’s a discussion of what the Real Python team arrived at.

Note: Remember, don’t open the collapsed section below until you’re ready to look at the answers for this Python practice problem!

The solution you’ll look at here is quite similar to the previous solution. You saw the slightly different set of test data above. The two test functions are basically the same as the football solution:

Python
      
    
# test_weather_v1.py
import pytest
import weather_v1 as wthr

 # ...

def test_get_max_avg(mock_csv_file):
    assert wthr.get_max_avg(mock_csv_file) == (1, 73.5)

def test_get_next_day_and_avg(mock_csv_data):
    reader = wthr.get_next_day_and_avg(mock_csv_data)
    assert next(reader) == (1, 73.5)
    assert next(reader) == (2, 71)
    with pytest.raises(StopIteration):
        next(reader)

While those tests are good, as you think about the problem more and find bugs in your solution, it’s good to add new tests. Here are some new tests to cover some of the corner cases you thought about at the end of the last problem:

Python
      
    
# test_weather_v1.py
import pytest
import weather_v1 as wthr

 # ...

def test_no_lines():
    no_data = []
    for _ in wthr.get_next_day_and_avg(no_data):
        assert False

def test_trailing_blank_lines(mock_csv_data):
    mock_csv_data.append("")
    all_lines = [x for x in wthr.get_next_day_and_avg(mock_csv_data)]
    assert len(all_lines) == 2
    for line in all_lines:
        assert len(line) == 2

def test_mid_blank_lines(mock_csv_data):
    mock_csv_data.insert(1, "")
    all_lines = [x for x in wthr.get_next_day_and_avg(mock_csv_data)]
    assert len(all_lines) == 2
    for line in all_lines:
        assert len(line) == 2

These tests cover cases where an empty file is passed in and also where there are blank lines in the middle or at the end of the CSV file. The case where there’s bad data in the first line of the file is a bit more challenging. If the first line doesn’t contain the labels, does the data still satisfy the requirements of the problem? The Real Python solution assumes this is invalid and doesn’t test for it.

The code itself doesn’t need to change much for this problem. As before, if you’re working through these solutions on your machine, now is a good time to copy football_v2.py to weather_v1.py.

If you start with the football solution, then the generator function is renamed to get_next_day_and_avg() and the function which calls it is now get_max_avg():

Python
      
    
# weather_v1.py
import csv

def get_next_day_and_avg(csv_file):
    for day_stats in csv.DictReader(csv_file):
        day_number = int(day_stats["Day"])
        avg = (int(day_stats["MxT"]) + int(day_stats["MnT"])) / 2
        yield day_number, avg

def get_max_avg(filename):
    with open(filename, "r", newline="") as csv_file:
        return max(get_next_day_and_avg(csv_file), key=lambda item: item[1])

In this case, you change get_next_day_and_avg() slightly. Instead of the team name and score differential, you’re now getting an integer representing the day number and computing the average temperature.

The function that calls get_next_day_and_avg() has changed to use max() instead of min() but also maintains the same structure.

Running the new tests against this code shows the advantage of using the tools from the standard library:

Shell
      
$ pytest test_weather_v1.py
============================= test session starts ==============================
platform linux -- Python 3.7.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jima/coding/realPython/articles/jima-csv
collected 5 items

test_weather_v1.py .....                                                 [100%]

============================== 5 passed in 0.05s ===============================

The new function passes the new blank line tests you added. The csv.DictReader() takes care of those cases for you. Your tests are running without errors and you have a great solution!

In an interview, it might be good to discuss the performance of your solution. With the small data files provided with the skeleton code here, the performance, both in terms of speed and memory usage, is not important. But what if the weather data was a daily report for the last century. Would this solution run into memory problems? Are there ways around those problems by redesigning the solution?

The two solutions so far have similar structures. In the next section, you’ll look at refactoring these solutions and how you might share code between them.

Python CSV Parsing: Refactoring

The two problems you’ve looked at so far are quite similar, and the programs to solve them have been quite similar as well. An interesting interview question might be to ask you to refactor these two solutions to find a way to share code and make them more maintainable.

Problem Description

This problem is a little different from the previous two. For this section, take your solutions from the previous problems and refactor them to reuse common code and structures. In a real-world situation, these solutions are small enough that the refactoring effort here is probably not worthwhile, but it does make for a good thought exercise.

Problem Solution

Here’s the refactoring that the Real Python team arrived at.

Note: Remember, don’t open the collapsed section below until you’re ready to look at the answers for this Python practice problem!

Start by looking at just the solution code for the two problems. The football solution, not counting the tests, was two functions long:

Python
      
    
# football_v2.py
import csv

def get_next_name_and_diff(csv_file):
    for team_stats in csv.DictReader(csv_file):
        diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
        yield team_stats["Team"], abs(diff)

def get_min_score_difference(filename):
    with open(filename, "r", newline="") as csv_data:
        return min(get_next_name_and_diff(csv_data), key=lambda item: item[1])

Similarly, the average temperature solution consisted of two functions. The similar structure points to areas for refactoring:

Python
      
    
# weather_v1.py
import csv

def get_next_day_and_avg(csv_file):
    for day_stats in csv.DictReader(csv_file):
        day_number = int(day_stats["Day"])
        avg = (int(day_stats["MxT"]) + int(day_stats["MnT"])) / 2
        yield day_number, avg

def get_max_avg(filename):
    with open(filename, "r", newline="") as csv_file:
        return max(get_next_day_and_avg(csv_file), key=lambda item: item[1])

When comparing code, it’s sometimes useful to use a diff tool to compare the text of each one. You might need to remove extra code from the files to get an accurate picture, though. In this case, the docstrings were removed. When you diff the two solutions, you can see that they’re quite similar:

File Changes (diff)
      
    
--- football_v2.py   2021-02-09 19:22:05.653628190 -0700
+++ weather_v1.py 2021-02-09 19:22:16.769811115 -0700
@@ -1,9 +1,10 @@
-def get_next_name_and_diff(csv_file):
-    for team_stats in csv.DictReader(csv_file):
-        diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
-        yield team_stats["Team"], abs(diff)
+def get_next_day_and_avg(csv_file):
+    for day_stats in csv.DictReader(csv_file):
+        day_number = int(day_stats["Day"])
+        avg = (int(day_stats["MxT"]) + int(day_stats["MnT"])) / 2
+        yield day_number, avg


-def get_min_score_difference(filename):
-    with open(filename, "r", newline="") as csv_data:
-        return min(get_next_name_and_diff(csv_data), key=lambda item: item[1])
+def get_max_avg(filename):
+    with open(filename, "r", newline="") as csv_file:
+        return max(get_next_day_and_avg(csv_file), key=lambda item: item[1])

Other than the names of the functions and variables, there are two main differences:

The football solution yields the difference of Goals For and Goals Against, while the weather solution yields the average of MxT and MnT.
The football solution finds the min() of the results, whereas the weather solution uses max().

The second difference might not be worth addressing, so let’s start with the first one.

The two generator functions are structurally the same. The part that’s different can generally be described as “Take a row of data and return two values from it,” which sounds like a function definition.

If you rewrite the football solution to have a function for that, it makes the program longer:

Python
      
    
# football_v3.py
import csv

def get_name_and_diff(team_stats):
    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
    return team_stats["Team"], abs(diff)

def get_next_name_and_diff(csv_file):
    for team_stats in csv.DictReader(csv_file):
        yield get_name_and_diff(team_stats)

While that code is longer, it brings up a few interesting points worth discussing in an interview. Sometimes when you’re refactoring, making code more readable will result in longer code. That’s probably not the case here as it’s hard to argue that having this function split out makes the code more readable.

There is another point, however. Sometimes to refactor code, you have to make the code less readable or concise to make the common portions visible. That’s definitely where you’re headed here.

Finally, this is an opportunity to discuss the single-responsibility principle. At a high level, the single-responsibility principle states that you want each section of code, a class, a method, or a function, to do only one thing or have only one responsibility. In the above refactoring, you’re pulling the responsibility of extracting values from each row’s data out of the function responsible for iterating over the csv.DictReader().

If you look back at the refactoring you did between solution 1 and solution 2 of the football problem above, you’ll see that the initial refactoring merged parse_next_line() and get_name_and_diff() into a single function. In this refactoring, you’re pulling them back apart! This might seem contradictory at first, so it’s worth examining more closely.

In the first refactoring, merging the two functions could easily be called a violation of the single responsibility principle. In that case, there’s a readability trade-off between having two small functions that only work together and merging them into one still-small function. Merging them, in that case, seems to make the code more readable, although that’s subjective.

In this case, you’re splitting the two functions apart for a different reason. The splitting here isn’t the final goal, but rather it’s a step on the way to your goal. By splitting the function into two, you’re able to isolate and share common code between the two solutions.

For such a small example, this splitting might not be justified. However, it allows you more opportunities for sharing code as you’ll see below. This technique, pulling a block of functionality out of one function and into a separate one, is frequently referred to as the extract method technique. Some IDEs and code editors provide tools to help you with this operation.

At this point, you haven’t gained anything, and the next step will make this code slightly more complex. You’ll pass get_name_and_diff() into the generator. This might seem counterintuitive at first, but it’ll allow you to reuse the generator structure:

Python
      
    
# football_v4.py
import csv

def get_name_and_diff(team_stats):
    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
    return team_stats["Team"], abs(diff)

def get_next_name_and_diff(csv_file, func):
    for team_stats in csv.DictReader(csv_file):
        yield func(team_stats)

def get_min_score_difference(filename):
    with open(filename, "r", newline="") as csv_data:
        return min(
            get_next_name_and_diff(csv_data, get_name_and_diff),
            key=lambda item: item[1],
        )

This might seem like a waste, but sometimes refactoring is a process of breaking solutions into pieces to isolate the parts that are different. Try making the same changes to the weather solution:

Python
      
    
# weather_v2.py
import csv

def get_day_and_avg(day_stats):
    day_number = int(day_stats["Day"])
    avg = (int(day_stats["MxT"]) + int(day_stats["MnT"])) / 2
    return day_number, avg

def get_next_day_and_avg(csv_file, func):
    for day_stats in csv.DictReader(csv_file):
        yield func(day_stats)

def get_max_avg(filename):
    with open(filename, "r", newline="") as csv_file:
        return max(
            get_next_day_and_avg(csv_file, get_day_and_avg),
            key=lambda item: item[1],
        )

That makes the two solutions look much more similar and, more importantly, really highlights the parts that are different between the two. Now the differences between the two solutions are largely contained in the functions that are passed in:

File Changes (diff)
      
    
--- football_v4.py   2021-02-20 16:05:53.775322250 -0700
+++ weather_v2.py 2021-02-20 16:06:04.771459061 -0700
@@ -1,19 +1,20 @@
 import csv


-def get_name_and_diff(team_stats):
-    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
-    return team_stats["Team"], abs(diff)
+def get_day_and_avg(day_stats):
+    day_number = int(day_stats["Day"])
+    avg = (int(day_stats["MxT"]) + int(day_stats["MnT"])) / 2
+    return day_number, avg


-def get_next_name_and_diff(csv_file, func):
-    for team_stats in csv.DictReader(csv_file):
-        yield func(team_stats)
+def get_next_day_and_avg(csv_file, func):
+    for day_stats in csv.DictReader(csv_file):
+        yield func(day_stats)


-def get_min_score_difference(filename):
-    with open(filename, "r", newline="") as csv_data:
-        return min(
-            get_next_name_and_diff(csv_data, get_name_and_diff),
+def get_max_avg(filename):
+    with open(filename, "r", newline="") as csv_file:
+        return max(
+            get_next_day_and_avg(csv_file, get_day_and_avg),
             key=lambda item: item[1],
         )

Once you’re to this point, you can rename the generator function to be more generic. You can also move it into its own module so you can reuse that code for both solutions:

Python
      
    
# csv_parser.py
import csv

def get_next_result(csv_file, func):
    for stats in csv.DictReader(csv_file):
        yield func(stats)

Now you can refactor each of the solutions to use this common code. This is the refactored version of the football solution:

Python
      
    
# football_final.py
import csv_reader

def get_name_and_diff(team_stats):
    diff = int(team_stats["Goals For"]) - int(team_stats["Goals Against"])
    return team_stats["Team"], abs(diff)

def get_min_score_difference(filename):
    with open(filename, "r", newline="") as csv_data:
        return min(
            csv_reader.get_next_result(csv_data, get_name_and_diff),
            key=lambda item: item[1],
        )

The final version of the weather solution, while similar, is different in those places where the problems require it:

Python
      
    
# weather_final.py
import csv_parser

def get_name_and_avg(day_stats):
    day_number = int(day_stats["Day"])
    avg = (int(day_stats["MxT"]) + int(day_stats["MnT"])) / 2
    return day_number, avg

def get_max_avg(filename):
    with open(filename, "r", newline="") as csv_file:
        return max(
            csv_parser.get_next_result(csv_file, get_name_and_avg),
            key=lambda item: item[1],
        )

The unit tests you’ve written can be split so that they test each module separately.

While this particular refactoring results in less code, it’s good to think about—and, in an interview context, discuss—whether this is a good idea. For this particular set of solutions, it’s probably not. The amount of code that’s shared here is about ten lines, and those lines were used only two times. Also, the two problems are fairly unrelated overall, which makes the combined solution a little less sensible.

However, if you had to do forty operations that fit into this mold, then this type of refactoring could be beneficial. Or if the generator function you shared was complex and difficult to get correct, then it would also be a bigger win.

These are excellent topics to discuss during an interview. For a problem set like this, however, you might want to talk about the most frequently used package when dealing with CSV files: pandas. You’ll look at that now.

Python CSV Parsing: pandas

So far, you used the csv.DictReader class from the standard library for your solutions, and that has worked well for these relatively small problems.

For larger problems, the pandas package can provide great results with excellent speed. Your last challenge is to rewrite the football program above using pandas.

Problem Description

Here’s your final problem for this tutorial. For this problem, you’ll rewrite your solution to the football problem using pandas. The pandas solution will likely look different than the solution using just the standard library.

Problem Solution

Here’s a discussion of the solution the team arrived at and how they got there.

Note: Remember, don’t open the collapsed section below until you’re ready to look at the answers for each of the Python practice problems!

The structure of this pandas solution is different from the standard library solution. Rather than using a generator, you use pandas to parse the file and create a DataFrame.

Because of this difference, your tests will look similar but slightly different:

Python
      
    
# test_football_pandas.py
import pytest
import football_pandas as fb

@pytest.fixture
def mock_csv_file(tmp_path):
    mock_csv_data = [
        "Team,Games,Wins,Losses,Draws,Goals For,Goals Against",
        "Liverpool FC, 38, 32, 3, 3, 85, 33",
        "Norwich City FC, 38, 5, 27, 6, 26, 75",
    ]
    datafile = tmp_path / "football.csv"
    datafile.write_text("\n".join(mock_csv_data))
    return str(datafile)

def test_read_data(mock_csv_file):
    df = fb.read_data(mock_csv_file)
    rows, cols = df.shape
    assert rows == 2
    # The dataframe df has all seven of the cols in the original dataset plus
    # the goal_difference col added in read_data().
    assert cols == 8

def test_score_difference(mock_csv_file):
    df = fb.read_data(mock_csv_file)
    assert df.team_name[0] == "Liverpool FC"
    assert df.goal_difference[0] == 52
    assert df.team_name[1] == "Norwich City FC"
    assert df.goal_difference[1] == 49

def test_get_min_diff(mock_csv_file):
    df = fb.read_data(mock_csv_file)
    diff = fb.get_min_difference(df)
    assert diff == 49

def test_get_team_name(mock_csv_file):
    df = fb.read_data(mock_csv_file)
    assert fb.get_team(df, 49) == "Norwich City FC"
    assert fb.get_team(df, 52) == "Liverpool FC"

def test_get_min_score(mock_csv_file):
    assert fb.get_min_score_difference(mock_csv_file) == (
        "Norwich City FC",
        49,
    )

These tests cover three actions:

Reading the file and creating the DataFrame
Finding the minimum differential
Finding the team name corresponding to that minimum

The tests are fairly similar to the tests from the first problem, so rather than examining the tests in detail, you can focus on the solution code and see how that works. You’ll start with a function called read_data() for creating the DataFrame:

Python
      
    
# football_pandas.py
import pandas as pd

def read_data(csv_file):
    return (
        pd.read_csv(csv_file)
        .rename(
            columns={
                "Team": "team_name",
                "Goals For": "goals",
                "Goals Against": "goals_allowed",
            }
        )
        .assign(goal_difference=lambda df: abs(df.goals - df.goals_allowed))
    )

Wow! That’s a bunch of code for a one-line function. Chaining method calls together like this is called using a fluent interface and is fairly common when working with pandas. Each method on a DataFrame returns a DataFrame object, so you can chain the method calls together.

The key to understanding code like this is to work through it from left to right and top to bottom if it spans multiple lines.

In this case, you start on line 6 with pd.read_csv(), which reads the CSV file and returns your initial DataFrame object.

The next step, on line 7, is to call .rename() on that returned DataFrame. This will rename the columns of the DataFrame to names that will work as attributes. The three columns you care about are renamed to team_name, goals, and goals_allowed. You’ll see how you access those in a little bit.

The return value from .rename() is a new DataFrame, and on line 14, you call .assign() on it to add a new column. The column will be called goal_difference, and you provide a lambda function to compute it for each row. Again, .assign() returns the DataFrame object it was called on, which is used for the return value for this function.

Note: pandas provides attributes for each of the column names that you’ll be using in this solution. This produces nice, readable results. It does have a potential pitfall, however.

If the attribute name conflicts with a DataFrame method in pandas, the naming conflict could cause unexpected behavior. If you’re in doubt, you can always use .loc[] to access the column values.

The next function in your solution shows some of the magic pandas can provide. get_min_score_difference() takes advantage of pandas’ ability to address an entire column as an object and call methods on it. In this instance, you call .min() to find the minimum value for the column:

Python
      
    
# football_pandas.py
def get_min_difference(parsed_data):
    return parsed_data.goal_difference.min()

There are several functions like .min() provided by pandas that allow you to manipulate rows and columns quickly and efficiently.

The next piece of your solution is finding the team name that corresponds with the minimum score differences. get_team() again uses a fluent programming style to chain several calls on a single DataFrame together:

Python
      
    
# football_pandas.py
def get_team(parsed_data, min_score_difference):
    return (
        parsed_data.query(f"goal_difference == {min_score_difference}")
        .reset_index()
        .loc[0, "team_name"]
    )

In this function, you call .query(), specifying that you want rows where the goal_difference column is equal to the minimum value you found earlier. The return value from .query() is a new DataFrame with the same columns but only those rows that match the query.

Because of some of the internals of how pandas manages indices on queries, the next call, .reset_index(), is needed to facilitate access to the first row of this new DataFrame. Once the index is reset, you call .loc[] to get row 0 and the team_name column, which should return the name of the team from the first row that matches the minimum score difference.

Finally, you need a function to put this all together and return both the team name and the minimum difference. As with the other solutions for this problem, this function is called get_min_score_difference():

Python
      
    
# football_pandas.py
def get_min_score_difference(csv_file):
    df = read_data(csv_file)
    min_diff = get_min_difference(df)
    team = get_team(df, min_diff)
    return team, min_diff

This function uses the previous three functions to put together the team name and minimum difference.

That completes your pandas version of the football program. It looks different from the other two solutions:

Python
      
    
# football_pandas.py
import pandas as pd

def read_data(csv_file):
    return (
        pd.read_csv(csv_file)
        .rename(
            columns={
                "Team": "team_name",
                "Goals For": "goals",
                "Goals Against": "goals_allowed",
            }
        )
        .assign(goal_difference=lambda df: abs(df.goals - df.goals_allowed))
    )

def get_min_difference(parsed_data):
    return parsed_data.goal_difference.min()

def get_team(parsed_data, min_score_difference):
    return (
        parsed_data.query(f"goal_difference == {min_score_difference}")
        .reset_index()
        .loc[0, "team_name"]
    )

def get_min_score_difference(csv_file):
    df = read_data(csv_file)
    min_diff = get_min_difference(df)
    team = get_team(df, min_diff)
    return team, min_diff

Now that you’ve seen a pandas-based solution to this problem, it’s a good idea to think about what makes this solution better or worse than the other ones you’ve seen. This type of discussion can be a great thing to bring up in an interview.

The pandas solution here is a bit longer than that standard library version, but it certainly could be shortened if that was the goal. For a small problem like this, pandas might be overkill for the problem. On larger, more complex problems, however, taking the extra time and complexity to bring in pandas can save you significant amounts of coding effort and provide solutions more quickly than working with the CSV library directly.

Another angle to discuss here is whether or not the project you’re working on has or is allowed to take on external dependencies. There are some projects where pulling in an extra package like pandas might require significant political or technical work. In cases like that, the standard library solution would be preferable.

Remove ads

Conclusion

That’s the end of this set of Python CSV parsing practice problems! You’ve gotten some practice applying your Python skills to CSV files and also got to spend some time thinking about trade-offs you can discuss during an interview. You then looked at refactoring solutions, both in terms of a single problem and also refactoring common code out of two solutions.

In addition to solving these problems, you covered:

Writing code with the csv.DictReader() class
Using pandas to solve CSV problems
Discussing your solutions during the interview
Talking about design decisions and trade-offs

You’re now ready to face a Python CSV parsing problem and discuss it in an interview! Feel free to reach out in the comments section below with any questions you have or suggestions for other Python practice problems you’d like to see. Good luck with the interview!

Remember, you can download the skeleton code for these problems by clicking the link below:

Get the Source Code: Click here to get the source code you’ll use to practice parsing CSV files in this tutorial.

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal.

Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!