Cleaning Text Based on Rules
00:07 The column that you’re looking at in this lesson is the place of publication. It’s quite an inconsistent column, but you can set up rules to replace values according to one or more Boolean tests.
So you can say, if this condition is true, then you use this value. In this case, the place of publication has values like this: let’s say
Virtue & Yorston, and you just want this to be
You just want to grab that city name. If there’s London in the title, then it should just be
London. Likewise, Oxford—this has a lot of extra information that you don’t want, so you can just get that to
Oxford. As a side note here, you’ll notice that there’s a date here, so maybe some of the values that are still
0 in the date of publication column can be taken from here. That won’t be covered in this course, but this will just sort of lose this information, but that’s something to be aware of if you’re ever doing some data cleaning in real life.
So these represent different rules, so one of them is if the value contains London or Oxford, they should be
Oxford, respectively, or if the value contains dashes, then you should just replace those with spaces.
Run that, see what we’ve got there. So immediately you can see there’s one that has
London. Let’s also look at whether some of them contain—
str.contains()—and then we’ll look at the dashes. Okay, but that’s not going to be immediately useful because it’s just going to return a Boolean array. However, you can use that Boolean array to filter out a series.
You can press up (↑) to get the old command and then wrap the whole thing in another
books.loc. Open the square brackets, and you’re passing in the Boolean array produced by the last one, and you just need to select the
"place-of-publication" column again.
You’ve got the previous command there where you’re looking for the dash within the column of
"place-of-publication", which returns a Boolean array, which you’re passing into the first argument of another location indexer and getting the
So now you can see all the values that have dashes in the name. So there’s quite a few there. You think that most of these, you can just suppress these, like
New-York doesn’t need a dash in there at all.
Now you have a bit of an idea of what kind of stuff you’re going to clean. For London, you want to replace the whole string with just
London. For dashes, you just want to replace the dash with a space.
In the same way as date of publication, you are just going make another, since this is all very similar,
place_of_publication. This is the column as a keyword argument, and you’ll make a function called
clean_place_of_publication(), which doesn’t exist yet, so you’re going to define it down here.
def clean_place_of_publication(). It’s also going to accept
books DataFrame, and it’s going to return
books, except it’s going to return just one column, which will represent the place of publication.
And as usual for London, what you’ll want is a
".*London.*". So what this means is that it can have any number of characters or no number of characters here before London or any number of characters or no characters after London, which basically means if this string contains
"London" anywhere, you’ll want to replace that just with
Run this and see what happens. And there’s an error:
Perhaps you forgot a comma? That’s a very helpful error. Okay, yep, there’s definitely the missing comma here. That looks good. So now run again.
See what’s going on.
The default—this is a warning—
The default value of regex
will change from True to False in a future version. Okay, so what you’ll want to do here is just to futureproof that is add in a
regex=True, because you are using a regex.
05:39 So that’s operating on the DataFrame and returning it, and now you’re chaining another one onto the result of this. So just copy the whole thing here. And the regex, in this case, will be very simple.
It’ll just be a dash and a space. Now by default,
str.replace() will replace all the occurrences, but you can pass an
n value in here to say, just replace
1 instance of that. But for this case, you want all the values here. Try and run this.
So here instead of
str.contains("-"), you’re looking for
str.contains("["). Oh, but you’ll probably have to escape that square bracket. Yes, saying
unterminated, which usually means that you have to escape a square bracket. Okay.
Paris. There’s one hundred forty-six of these values that have square brackets that are there to clean, but you can sort of whittle away at them with different of these
This column accepts the DataFrame and it chaings on a bunch of
str.replace() operations here, replacing if it contains
"London". Now, clearly there’s a few ways to do this. You could chain a
str.contains(), get a Boolean function, select all of them, but this is a much more concise way of doing it. And it expresses a lot better what you want to do.
Remember there’s various ways to do something in pandas. You want to choose the most specific way, because that will be, at the very least, the most optimized. There is, for example, just a plain
replace() function without the
str, but that’s a sort of general Swiss Army knife, all-purpose
replace() that can work on strings or numbers or almost anything.
09:22 In this lesson, you’ve cleaned text based on rules. And this has been the last lesson of Data Cleaning With pandas. In the next lesson, you’re going to review everything that you’ve done so far.
Become a Member to join the conversation.