Cleaning Text Based on Rules

Data Cleaning With pandas and NumPy Ian Currie 09:33

Transcript
Discussion

00:00 In this last lesson of the books dataset, you’ll be cleaning text based on rules.

00:07 The column that you’re looking at in this lesson is the place of publication. It’s quite an inconsistent column, but you can set up rules to replace values according to one or more Boolean tests.

00:18 So you can say, if this condition is true, then you use this value. In this case, the place of publication has values like this: let’s say London; Virtue & Yorston, and you just want this to be London.

00:31 You just want to grab that city name. If there’s London in the title, then it should just be London. Likewise, Oxford—this has a lot of extra information that you don’t want, so you can just get that to Oxford. As a side note here, you’ll notice that there’s a date here, so maybe some of the values that are still 0 in the date of publication column can be taken from here. That won’t be covered in this course, but this will just sort of lose this information, but that’s something to be aware of if you’re ever doing some data cleaning in real life.

01:04 And here you also want any hyphens (-) or dashes to be replaced by spaces, so Newcastle-upon-Tyne should just be Newcastle upon Tyne.

01:13 So these represent different rules, so one of them is if the value contains London or Oxford, they should be London or Oxford, respectively, or if the value contains dashes, then you should just replace those with spaces.

01:29 Now to clean the place of publication, first off, you’re going to explore the data a little bit. So run the code as it is, get your books, and let’s select the "place-of-publication" column.

01:51 Run that, see what we’ve got there. So immediately you can see there’s one that has London. Let’s also look at whether some of them contain—str.contains()—and then we’ll look at the dashes. Okay, but that’s not going to be immediately useful because it’s just going to return a Boolean array. However, you can use that Boolean array to filter out a series.

02:19 You can press up (↑) to get the old command and then wrap the whole thing in another books.loc[]. Open the square brackets, and you’re passing in the Boolean array produced by the last one, and you just need to select the "place-of-publication" column again.

02:42 You’ve got the previous command there where you’re looking for the dash within the column of "place-of-publication", which returns a Boolean array, which you’re passing into the first argument of another location indexer and getting the "place-of-publication" column.

02:58 So now you can see all the values that have dashes in the name. So there’s quite a few there. You think that most of these, you can just suppress these, like New-York doesn’t need a dash in there at all.

03:12 Now you have a bit of an idea of what kind of stuff you’re going to clean. For London, you want to replace the whole string with just London. For dashes, you just want to replace the dash with a space.

03:23 In the same way as date of publication, you are just going make another, since this is all very similar, place_of_publication. This is the column as a keyword argument, and you’ll make a function called clean_place_of_publication(), which doesn’t exist yet, so you’re going to define it down here.

03:43 def clean_place_of_publication(). It’s also going to accept books DataFrame, and it’s going to return books, except it’s going to return just one column, which will represent the place of publication.

04:02 The first thing to do here is to select the whole column,

04:08 as usual … "place_of_publication". I’m going to wrap these in parentheses just so we can put this on newlines for readability sake. You can use the str.replace() functionality here.

04:26 And as usual for London, what you’ll want is a ".*London.*". So what this means is that it can have any number of characters or no number of characters here before London or any number of characters or no characters after London, which basically means if this string contains "London" anywhere, you’ll want to replace that just with "London".

04:55 Run this and see what happens. And there’s an error: Perhaps you forgot a comma? That’s a very helpful error. Okay, yep, there’s definitely the missing comma here. That looks good. So now run again.

05:10 See what’s going on. The default—this is a warning—The default value of regex will change from True to False in a future version. Okay, so what you’ll want to do here is just to futureproof that is add in a regex=True, because you are using a regex.

05:28 Back to here, books.head(), and let’s look at what’s happened here. Okay, so that first value that had a bunch of London has been replaced. Okay so now you can just chain another one onto here.

05:39 So that’s operating on the DataFrame and returning it, and now you’re chaining another one onto the result of this. So just copy the whole thing here. And the regex, in this case, will be very simple.

05:54 It’ll just be a dash and a space. Now by default, str.replace() will replace all the occurrences, but you can pass an n value in here to say, just replace 1 instance of that. But for this case, you want all the values here. Try and run this.

06:14 Let’s see.

06:17 Now you’re going to navigate using the up and down arrows to the old command here, which is looking for any value that contains a dash. And there are none. That’s good.

06:29 So you’ve cleaned up "London" and you’ve cleaned up all the dashes. So what’s left? Let’s see. Let’s get that same command here, but instead of a dash, look for a square bracket.

06:45 So here instead of str.contains("-"), you’re looking for str.contains("["). Oh, but you’ll probably have to escape that square bracket. Yes, saying unterminated, which usually means that you have to escape a square bracket. Okay.

07:00 Here’s a few of them. Plymouth, Cambridge, Oxford, Portland, Boston [Mass.], Boston [U.S.].

07:06 Choose Oxford because there’s Cambridge, Massachusetts, and there’s Cambridge, U.K. So choose Oxford and Plymouth.

07:16 What you can do here is just copy this

07:22 and replace these values with "Oxford"

07:29 and "Plymouth".

07:36 And now run this. That’s run. Now we’ll look for any more square brackets … and that has worked because now there’s no longer any Oxfords in here. There’s Cambridge [Mass.], Portland [Maine].

07:55 Madrid, Neuva York, Paris. There’s one hundred forty-six of these values that have square brackets that are there to clean, but you can sort of whittle away at them with different of these .replace() rules.

08:09 And again, just to recap, the .assign() statement here, you’ve done the date_of_publication. You’ve passed in a function that accepts the DataFrame, cleans up the dates.

08:19 And now you’re done the same here with place_of_publication. This keyword argument represents the column. The result of this function will be the new column.

08:28 This column accepts the DataFrame and it chaings on a bunch of str.replace() operations here, replacing if it contains "London". Now, clearly there’s a few ways to do this. You could chain a str.contains(), get a Boolean function, select all of them, but this is a much more concise way of doing it. And it expresses a lot better what you want to do.

08:52 Remember there’s various ways to do something in pandas. You want to choose the most specific way, because that will be, at the very least, the most optimized. There is, for example, just a plain replace() function without the str, but that’s a sort of general Swiss Army knife, all-purpose replace() that can work on strings or numbers or almost anything.

09:12 And you, in general, want to go for the most specific, again just for reliability and because it will be optimized for that operation.

09:22 In this lesson, you’ve cleaned text based on rules. And this has been the last lesson of Data Cleaning With pandas. In the next lesson, you’re going to review everything that you’ve done so far.

Become a Member to join the conversation.