Cleaning University Towns Data
For this lesson, you’ll be going documentation diving, so make sure you have your pandas documentation at the ready.
You’ll also be utilizing regex, so you might want to check out Real Python’s two-part series on regex here:
00:00 In this lesson, you’ll be getting the CSV data back, and you’re being asked to clean it. So you’re going to load it into a DataFrame again and see what needs to be done.
00:10 Some time has passed since you cleaned up that plain text file, and the CSV file you generated is being used, which is good. Then one day they give it back to you and ask for it to be cleaned finally.
00:22 You wonder why they didn’t let you clean it from the start, but, well, I guess that is just life. You search for your code, but you can’t find it. So you’re starting from scratch again.
00:32 But at least this time, you can load it straight into a pandas DataFrame because you know it’s a CSV file.
00:39
You have your boilerplate code here. You’re reading the towns.csv
and putting it into a variable. Run this and take a look at what’s going on here.
00:50
towns.head()
, Control + Enter, and okay, it looks like the index is getting duplicated because the index was already there, and it’s saying you’re going to have another one. So let’s fix that.
01:05
You can pass in an argument to read_csv()
called index_col
for index column. Pass it 0
because it’s the first column. Control + Enter to run it.
01:16 Let’s go here, press up (↑) to get the last command. Control + Enter to run. There we go. Now, the thing is to find a way to get rid of these, get rid of these square brackets as well, and the newline here.
01:32
What’s the strategy here? This is a perfect time for the .assign()
method. Then you’ll find relevant methods to help clean up the data. Now, this is a common workflow in pandas, where you have something you want to do, you can’t remember the exact method to use, so you do some documentation diving to find the right one.
01:50 You suspect this is likely some kind of string method. There are a bunch of string methods, so that will be a good first place to look.
01:58
Before getting into that, there is something that you should probably check. The assumption is that this string, "[edit]"
in square brackets, is in every single column.
02:12
A way to start exploring the string methods first is to select the whole column of state
.
02:23
So that’s using .loc[]
and then the slice index (:
) to select every row, and then the "state"
label to select the state
column. That’ll give you every column.
02:32
So now. string methods. They are available on series, like this one. You can access them with the str
object, and here you’ll get a bunch of suggestions. This is in documentation too. If you want, if the suggestions are not coming up, try pressing Control + Space to get that.
02:52
And here is one that looks good: .contains()
. So this actually takes some regex. You need to escape the square brackets so that it doesn’t interpret it as part of the regex and not a literal square bracket.
03:11
And now look at what this produces: a series of True
, False
values. Now to check if all of them are true, you can pass on .all()
which will evaluate whether all of them are truthy. And that’s true.
03:31
So now you can assume that all the values in the state
column have this string, so it should be quite easy to just remove that string from all of them.
03:44
Now that you have a good idea of what you need to do to clean the university towns data, in the next lesson, you’ll be looking at the .assign()
method to carry out your cleaning operations.
Become a Member to join the conversation.