# Python Interview Problem – Parsing CSV Files

by Real Python May 31, 2014 data-science fundamentals python

So, a friend of mine recently interviewed for a back-end Python developer position, and the initial interview consisted of answering the following problem. He was given two hours.

## Problem

1. Football: The football.csv file contains the results from the English Premier League. The columns labeled ‘Goals’ and ‘Goals Allowed’ contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them). Write a program to read the file, then print the name of the team with the smallest difference in ‘for’ and ‘against’ goals.
2. Weather: In weather.csv you’ll find daily weather data. Write a program to read the file, then output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).
3. See if you can write the same program to solve both questions.
4. Test Driven Development!

You can grab all the files here.

Try this on your own before you look at my answer below.

Note: I have not solved this problem before, so I will be going through a number of iterations. Hopefully, this will give you insight into the process/workflow I went through. Once finished, compare your workflow with mine. What did you do differently? Finally, I will be using a true TDD approach for this. In other words, after I write a test, I will only write the bare minimum amount of code to get it to pass. That said, there will most likely be plenty of refactoring throughout. Bare with me. Patience.

## Part 1: Football

Here’s the input “football.csv” file:

```Team,Games,Wins,Losses,Draws,Goals,Goals Allowed,Points
Arsenal,38,26,9,3,79,36,87
Liverpool,38,24,8,6,67,30,80
Manchester United,38,24,5,9,87,45,77
Newcastle,38,21,8,9,74,52,71
Leeds,38,18,12,8,53,37,66
Chelsea,38,17,13,8,66,38,64
West_Ham,38,15,8,15,48,57,53
Aston_Villa,38,12,14,12,46,47,50
Tottenham,38,14,8,16,49,53,50
Blackburn,38,12,10,16,55,51,46
Southampton,38,12,9,17,46,54,45
Middlesbrough,38,12,9,17,35,47,45
Fulham,38,10,14,14,36,44,44
Charlton,38,10,14,14,38,49,44
Everton,38,11,10,17,45,57,43
Bolton,38,9,13,16,44,62,40
Sunderland,38,10,10,18,29,51,40
Ipswich,38,9,9,20,41,64,36
Derby,38,8,6,24,33,63,30
Leicester,38,5,13,20,30,64,28
```

In true TDD style, we start by writing our tests that will fail. Since we must first read in the CSV data, let’s ensure that we can do that, by testing that the data read in is what we think it should be.

```import unittest

class ParseCSVTest(unittest.TestCase):

def setUp(self):
self.data = 'football.csv'

self.assertEqual(
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points']
)

if __name__ == '__main__':
unittest.main()
```

Here, we are using a function imported from parse_csv.py called `read_data()`, which we haven’t written yet, to read in the data. Then we have three unit tests to check the values of the data read in.

```\$ python parse_csv_test.py
Traceback (most recent call last):
File "parse_csv_test.py", line 2, in <module>
```

### Write the `read_data()` function

Next, let’s write just enough code in parse_csv.py to get our tests to pass.

```import csv

with open(data, 'r') as f:
return data

# ---- run code ---- #

data = "football.csv"
```

Run the tests again:

```\$ python parse_csv_test.py -v

----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK
```

### Test Redux

Back to the test file. This time, we want to find the “smallest difference in ‘for’ and ‘against’ goals”. So, add this import and function:

```from parse_csv import read_data, get_min_score_difference

# ...

def test_get_min_score_difference(self):
self.assertEqual(get_min_score_difference(parsed_data), "no idea")
```

Since, we don’t know off hand what the smallest difference will be, we can just add the string “no idea”. You could update that when you know what the answer is after we add the `get_min_score_difference()` function to parse_csv.py - but how do we know if that answer is correct?

Keep in mind, that since this is a small amount of data, you can get away with calculating them all by hand and finding the smallest difference. Or you could also use test data.:

```def test_get_min_score_difference(self):
parsed_data = [
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points'],
['Arsenal', '38', '26', '9', '3', '79', '36', '87'],
['Liverpool', '38', '24', '8', '6', '67', '30', '80']
]
self.assertEqual(get_min_score_difference(parsed_data), '37')
```

Let’s go with the latter, since we know this is correct.

Run the test. Watch it fail.

### Write the `get_min_score_difference()` function

To simplify this, let’s use list comprehensions:

```def get_min_score_difference(parsed_data):
parsed_data.pop(0)
goals = [x[5] for x in parsed_data]
goals_allowed = [x[6]for x in parsed_data]
return min([float(x) - float(y) for x, y in zip(goals, goals_allowed)])
```

So, what’s going on:

1. First, we removed the header row since it will just get in the way.
2. Next, we created two new lists - one containing the goals score, the other containing the goals allowed
3. Finally, we simply created another new list containing the values from the first two lists subtracted, then we returned the smallest value.

Test:

```\$ python parse_csv_test.py -v
test_get_min_score_difference (__main__.ParseCSVTest) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK
```

They pass. But remember - There is another step to the TDD process: Refactoring. (Hint, hint).

Ask yourself: Since we eventually want the name of the team with the smallest spread, does it help to get the minimum value? No. It would be much easier to find the minimum value and then return the index value so that we can plug that in to the next function to easily get the name of the team.

First, rewrite the test:

```def test_get_min_score_difference(self):
data = [
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points'],
['Arsenal', '38', '26', '9', '3', '79', '36', '87'],
['Liverpool', '38', '24', '8', '6', '67', '30', '80']
]
self.assertEqual(get_min_score_difference(data), 1)
```

Now, let’s get it to pass:

```def get_min_score_difference(parsed_data):
parsed_data.pop(0)
goals = [x[5] for x in parsed_data]
goals_allowed = [x[6]for x in parsed_data]
values = [float(x) - float(y) for x, y in zip(goals, goals_allowed)]
return values.index(min(values))
```

Test again.

If you’re having trouble following the list comprehensions, try rewriting them using the regular list construct. For example:

```for x in parsed_data:
goals.append(x[5])
```

### Final Test

Again, let’s test to see which team has the smallest range between goals scored and goals allowed:

```def test_get_team(self):
data = [
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points'],
['Arsenal', '38', '26', '9', '3', '79', '36', '87'],
['Liverpool', '38', '24', '8', '6', '67', '30', '80']
]
index_value = get_min_score_difference(data)
self.assertEqual(get_team(index_value), 'Liverpool')
```

Make sure to import in the `get_team` function. Run the tests file. This will fail. Success!

### Write the `get_team()` function

Simply use that index value from the previous function as an argument, then create a list of all the teams, and finally pass in that index value to that list of teams:

```def get_team(index_value, parsed_data):
teams = [x[0] for x in parsed_data]
return teams[index_value]
```

Tests pass.

### Refactor Tests

In our tests, let’s move our parsed data to the `setUp` so that we’re not repeating ourselves:

```import unittest
from parse_csv import read_data, get_min_score_difference, get_team

class ParseCSVTest(unittest.TestCase):

def setUp(self):
self.data = 'football.csv'
self.parsed_data = [
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points'],
['Arsenal', '38', '26', '9', '3', '79', '36', '87'],
['Liverpool', '38', '24', '8', '6', '67', '30', '80']
]

self.assertEqual(
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points']
)

def test_get_min_score_difference(self):
self.assertEqual(get_min_score_difference(self.parsed_data), 1)

def test_get_team(self):
index_value = get_min_score_difference(self.parsed_data)
self.assertEqual(get_team(index_value), 'Liverpool')

if __name__ == '__main__':
unittest.main()
```

### Refactor Code

So, per the instructions, we need to make this code work for both CSV files. Let’s refactor our code from using a procedural structure or an OOP structure, meant for code reuse:

```import csv

class ParseCSV(object):

def __init__(self, data):
self.data = data

with open(self.data, 'r') as f:
return parsed_data

def get_min_score_difference(self, parsed_data):
parsed_data.pop(0)
goals = [x[5] for x in parsed_data]
goals_allowed = [x[6]for x in parsed_data]
values = [float(x) - float(y) for x, y in zip(goals, goals_allowed)]
return values.index(min(values))

def get_team(self, index_value, parsed_data):
teams = [x[0] for x in parsed_data]
return teams[index_value]
```

Try running the tests; they will fail:

```\$ python parse_csv_test.py
Traceback (most recent call last):
File "parse_csv_test.py", line 2, in <module>
from parse_csv import read_data, get_min_score_difference, get_team
```

### Refactor Tests Redux

```import unittest
from parse_csv import ParseCSV

class ParseCSVTest(unittest.TestCase):

def setUp(self):
self.data = 'football.csv'
self.parsed_data = [
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points'],
['Arsenal', '38', '26', '9', '3', '79', '36', '87'],
['Liverpool', '38', '24', '8', '6', '67', '30', '80']
]
self.football = ParseCSV(self.data)

self.assertEqual(
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points']
)

def test_get_min_score_difference(self):
self.assertEqual(self.football.get_min_score_difference(self.parsed_data), 1)

def test_get_team(self):
index_value = self.football.get_min_score_difference(self.parsed_data)
self.assertEqual(self.football.get_team(index_value, self.parsed_data), 'Liverpool')

if __name__ == '__main__':
unittest.main()
```

There are minimal changes here. At this point, you could create a separate CSV file for testing. Instead, let’s go on to the next part and look at the CSV issue later.

Remember: Let’s not over optimize too early - just make the minimal changes necessary to get the tests to pass. We do have some naming issues and a few other issues to refactor. However, let’s work on the second part of the problem, incorporating the weather data into this code, then refactor at the end.

## Part 2: Weather

Here’s the input “weather.csv” file:

```Day,MxT,MnT,AvT,AvDP,1HrP TPcpn,PDir,AvSp,Dir,MxS,SkyC,MxR,Mn,R AvSLP
1,88,59,74,53.8,0,280,9.6,270,17,1.6,93,23,1004.5
2,79,63,71,46.5,0,330,8.7,340,23,3.3,70,28,1004.5
3,77,55,66,39.6,0,350,5,350,9,2.8,59,24,1016.8
4,77,59,68,51.1,0,110,9.1,130,12,8.6,62,40,1021.1
5,90,66,78,68.3,0,220,8.3,260,12,6.9,84,55,1014.4
6,81,61,71,63.7,0,30,6.2,30,13,9.7,93,60,1012.7
7,73,57,65,53,0,50,9.5,50,17,5.3,90,48,1021.8
8,75,54,65,50,0,160,4.2,150,10,2.6,93,41,1026.3
9,86,32,59,61.5,0,240,7.6,220,12,6,78,46,1018.6
10,84,64,74,57.5,0,210,6.6,50,9,3.4,84,40,1019
11,91,59,75,66.3,0,250,7.1,230,12,2.5,93,45,1012.6
12,88,73,81,68.7,0,250,8.1,270,21,7.9,94,51,1007
13,70,59,65,55,0,150,3,150,8,10,83,59,1012.6
14,61,59,60,55.9,0,60,6.7,80,9,10,93,87,1008.6
15,64,55,60,54.9,0,40,4.3,200,7,9.6,96,70,1006.1
16,79,59,69,56.7,0,250,7.6,240,21,7.8,87,44,1007
17,81,57,69,51.7,0,260,9.1,270,29,5.2,90,34,1012.5
18,82,52,67,52.6,0,230,4,190,12,5,93,34,1021.3
19,81,61,71,58.9,0,250,5.2,230,12,5.3,87,44,1028.5
20,84,57,71,58.9,0,150,6.3,160,13,3.6,90,43,1032.5
21,86,59,73,57.7,0,240,6.1,250,12,1,87,35,1030.7
22,90,64,77,61.1,0,250,6.4,230,9,0.2,78,38,1026.4
23,90,68,79,63.1,0,240,8.3,230,12,0.2,68,42,1021.3
24,90,77,84,67.5,0,350,8.5,10,14,6.9,74,48,1018.2
25,90,72,81,61.3,0,190,4.9,230,9,5.6,81,29,1019.6
26,97,64,81,70.4,0,50,5.1,200,12,4,107,45,1014.9
27,91,72,82,69.7,0,250,12.1,230,17,7.1,90,47,1009
28,84,68,76,65.6,0,280,7.6,340,16,7,100,51,1011
29,88,66,77,59.7,0,40,5.4,20,9,5.3,84,33,1020.6
30,90,45,68,63.6,0,240,6,220,17,4.8,200,41,1022.7
```

### Watch our tests Fail

Before we do anything, let’s use the new CSV file and watch our tests fail:

```\$ python parse_csv_test.py -v
test_get_min_score_difference (__main__.ParseCSVTest) ... ok
test_get_team (__main__.ParseCSVTest) ... ok
```

Notice how the only tests that pass are the tests that use `parsed_data` rather than the CSV. This is a good indication that we really should be using the actual data for testing. Let’s quickly refactor (again!):

```import unittest
from parse_csv import ParseCSV

class FootballParseCSVTest(unittest.TestCase):

def setUp(self):
self.data = 'football.csv'
self.football = ParseCSV(self.data)

self.assertEqual(
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points']
)

def test_get_min_score_difference(self):
self.assertEqual(self.football.get_min_score_difference(parsed_data), 19)

def test_get_team(self):
index_value = self.football.get_min_score_difference(parsed_data)
self.assertEqual(self.football.get_team(index_value, parsed_data), 'Leicester')

if __name__ == '__main__':
unittest.main()
```

These will pass. Now change the data so that they run with weather.csv:

```\$ python parse_csv_test.py -v
test_get_min_score_difference (__main__.FootballParseCSVTest) ... ERROR
test_get_team (__main__.FootballParseCSVTest) ... ERROR
```

This is exactly what we want to see. Now we need to refactor both our code and tests to get our code to work with both data sets as well as simplify our tests to eliminate redundancy. (Notice a trend yet?)

### Refactor Tests and Code

#### Test

```def setUp(self):
self.football_data = 'football.csv'
self.weather_data = 'weather.csv'
self.football = ParseCSV(self.football_data)
self.weather = ParseCSV(self.weather_data)
```

Here, we are using both data sets and instantiating multiple instances of the same class, `ParseCSV()`. Yes, there is some redundancy here, but let’s just keep it simple for now. We’ll refactor at the end.

#### Test

```def test_csv_read_data_headers(self):
self.assertEqual(
['Team', 'Games', 'Wins', 'Losses', 'Draws', 'Goals', 'Goals Allowed', 'Points']
)
self.assertEqual(
['Day', 'MxT', 'MnT', 'AvT', 'AvDP', '1HrP TPcpn', 'PDir', 'AvSp', 'Dir', 'MxS', 'SkyC', 'MxR', 'Mn', 'R AvSLP']
)

```

This is straightforward. Comment out the other two tests and run just this one. It should pass.

#### Test

```def test_get_min_difference(self):
self.assertEqual(self.football.get_min_score_difference(football_parsed_data), 19)
self.assertEqual(self.weather.get_min_score_difference(weather_parsed_data), "no idea")
```

Uncomment this test. It will fail:

```\$ python parse_csv_test.py -v
test_get_min_difference (__main__.FootballParseCSVTest) ... FAIL
```

Why does it fail?

```AssertionError: 2 != 'no idea'
```

This essentially is saying that the smallest difference between the second and third column is row two. Is that correct? We addressed this very same issue earlier:

Since, we don’t know off hand what the smallest difference will be, we can just add the string “no idea”. You could update that when you know what the answer is after we add the `get_min_score_difference()` function to parse_csv.py - but how do we know if that answer is correct?

Keep in mind, that since this is a small amount of data, you can get away with calculating them all by hand and finding the smallest difference. Or you could also use test data.

Let’s go the other route this time: Just calculating it by hand. Do not do this if you have a lot of data!.

Do this now.

You should find that the row with the smallest difference is row 14 (or 13 minus the header). Update your tests, run them again, and they should still fail:

```AssertionError: 2 != 13
```

Why is this?

#### Code

Go back to parse.csv and look at the `get_min_score_difference()` function:

```def get_min_score_difference(self, parsed_data):
parsed_data.pop(0)
goals = [x[5] for x in parsed_data]
goals_allowed = [x[6]for x in parsed_data]
values = [float(x) - float(y) for x, y in zip(goals, goals_allowed)]
return values.index(min(values))
```

Did you figure it out?

We’re passing in the column values - `x[5]` and x[6] - which are applicable only to the football.csv. Thus, we need to pass in either the column index values or the header names so that this function uses the right data regardless of the data set used. Since we’re popping off the header, let’s use the column index values:

```def get_min_difference(self, parsed_data, column1, column2):
parsed_data.pop(0)
column1_list = [x[column1] for x in parsed_data]
column2_list = [x[column2]for x in parsed_data]
values = [float(x) - float(y) for x, y in zip(column1_list, column2_list)]
return values.index(min(values))
```

#### Test

Update the test:

```def test_get_min_difference(self):
self.assertEqual(self.football.get_min_difference(football_parsed_data, 5, 6), 19)
self.assertEqual(self.weather.get_min_difference(weather_parsed_data, 1, 2), 13)
```

And they pass:

```\$ python parse_csv_test.py -v
test_get_min_difference (__main__.FootballParseCSVTest) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK
```

Uncomment the last test and update `get_min_score_difference()` to `get_min_difference()` then pass in the columns:

```def test_get_name(self):
index_value = self.football.get_min_difference(football_parsed_data, 5, 6)
self.assertEqual(self.football.get_team(index_value, football_parsed_data), 'Leicester')
index_value = self.weather.get_min_difference(weather_parsed_data, 1, 2)
self.assertEqual(self.weather.get_team(index_value, weather_parsed_data), '14')
```

Now they all pass:

```\$ python parse_csv_test.py -v
test_get_min_difference (__main__.FootballParseCSVTest) ... ok
test_get_name (__main__.FootballParseCSVTest) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK
```

Finally, let’s add our own test data for use rather than using the actual CSV files to isolate test data from actual, real data.

```import unittest
from parse_csv import ParseCSV

class testParseCSVTest(unittest.TestCase):

def setUp(self):
self.data = 'test.csv'
self.test = ParseCSV(self.data)

self.assertEqual(
self.parsed_data[0],
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
)

def test_get_min_difference(self):
self.assertEqual(self.test.get_min_difference(self.parsed_data, 2, 3), 6)

def test_get_name(self):
index_value = self.test.get_min_difference(self.parsed_data, 2, 3)
self.assertEqual(self.test.get_team(index_value, self.parsed_data), '4/30/14')

if __name__ == '__main__':
unittest.main()
```

Run the tests. They should pass.

### One last thing …

One thing to keep in mind is the nature of the `pop()` method. It’s destructive, so it alters the original list, `parsed_data`, removing the headers permanently within the `get_min_difference()` function. What are the ramifications of this?

Let’s test it out.

Add a `print` statement to the `test_get_name()` function:

```def get_team(self, index_value, parsed_data):
print parsed_data[0]
teams = [x[0] for x in parsed_data]
return teams[index_value]
```

Run the tests again.

```\$ python parse_csv_test.py
...['5/8/14', '508.46', '517.23', '506.45', '511', '2015800', '511']
.
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK
```

Notice that row 0 contains data, not the headers. This can create problems. Sure, you could just assign a variable to the data when you pop it off - `headers = parsed_data.pop(0)` - then add the data back to the list. But, it’s generally better to work with non-destructive methods if you do not actually need to alter the original structure.

Thus, let’s use `slice` instead. Update the `get_min_difference()` function:

```def get_min_difference(self, parsed_data, column1, column2):
column1_list = [x[column1] for x in parsed_data[1:]]
column2_list = [x[column2]for x in parsed_data[1:]]
values = [float(x) - float(y) for x, y in zip(column1_list, column2_list)]
return values.index(min(values))
```

Run the test again. You should see the following failure:

```\$ python parse_csv_test.py
...F
======================================================================
FAIL: test_get_name (__main__.testParseCSVTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "parse_csv_test.py", line 27, in test_get_name
self.assertEqual(self.test.get_team(index_value, self.parsed_data), '4/30/14')
AssertionError: '5/1/14' != '4/30/14'

----------------------------------------------------------------------
Ran 4 tests in 0.001s

FAILED (failures=1)
```

This is because we are now dealing with data that has the header back in within the `get_team` function while the header was sliced off in `get_min_difference()`. Thus, we need to update the `get_team()` function:

```def get_team(self, index_value, parsed_data):
teams = [x[0] for x in parsed_data[1:]]
return teams[index_value]
```

Run the tests one last time… And you’re done!

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

What do you think?

Real Python Comment Policy: The most useful comments are those written with the goal of learning from or helping out other readers—after reading the whole article and all the earlier comments. Complaints and insults generally won’t make the cut here.