Exploring the Books Dataset

Data Cleaning With pandas and NumPy Ian Currie 05:08

Transcript
Discussion (2)

00:00 In this lesson, you’ll be moving on and doing some initial exploration of the books dataset, the third and final dataset of this course. This data is typical of what a library might have: title, author, place of publication, year of publication, that type of thing. However, in this dataset, there are many NaN values.

00:20 These NaN values stand for not a number, which is a default value that pandas has when it can’t read the value. You also have many columns that are pretty much full of these NaN values, so you don’t really need them.

00:35 You’re also not sure if the identifier is unique, and some of the column formats are not consistent. So here’s a quick look at the raw data. Let’s turn off word wrap.

00:50 Okay, and as you can see, it’s got a header row. There are a lot of blank values, especially in places like this, where there are lots of commas together.

01:01 As you can see, there are just blank values. So these are likely columns that you won’t need.

01:09 Okay, so you have the basic boilerplate here. Control + Enter to run that … and it’s connecting to the kernel …

01:21 working away … and it’s done. So now, if you look at books.head(), Control + Enter to run it, and there you go: it’s done pretty well by itself, but as you can see, there are lots of NaN values.

01:32 There are also these extra bits to a lot of the columns. As you can see, most of the values are just one place, but then this has some extra information that really is just noise in this case, and you’ll want to clean that up. Likewise, with date of publication, there are lots of these square bracket ([]) things going on.

01:50 And also there are a lot of columns with what seems to be fully NaN values. So there’s not very interesting stuff there. Not really interested in keeping that.

02:03 The first thing you want to do is do some renaming with the .rename() method. Now you can do it like you did before, where, say, you have Edition Statement, and you pass in a mapper object where you’re—oops, you have to write in here, columns =, and then you pass it a map of a dictionary with the existing title and the title that you want.

02:36 And then take a look at this. And as you can see, it’s renamed the column Edition Statement into one with snake case, with no spaces and all lowercase. However, since all the columns are actually quite well named, and you don’t really want to change any of them, you just want them all to be in this sort of snake case format, you can actually pass in a lambda function here,

03:07 and this will pass each header into this function, and then you can transform it and return it however you want. Since it’s a string, you could call the .lower() method on it, which will send it to lowercase, and then you can call the .replace() method on the result of that, and you can replace any space with an underscore (_).

03:30 So now let’s try running that. Up (↑) to get the last command. Control + Enter. And there you go. All the titles are now lowercase. To see that more clearly, you can call on the .columns attribute of the DataFrame to see all the column names, and as you can see, they’re all in a nice snake case format. However, there’s still one that you’d probably like to be shorter.

03:58 So how about you rename here? We’re just going to chain that onto the end and use a mini-dictionary

04:11 to change that to just "id".

04:17 Running here … and now let’s look at the columns here. Okay, so now we have all our columns renamed. You’ll notice that this is being reformatted to this sort of method-chaining style. It’s wrapped in parentheses, which allows this to be on separate lines, because usually it would have to be sort of chained on here.

04:42 And this is the typical way you’ll see a lot of pandas code written in this sort of long chain, which at the start is great because it just shows you every step that’s needed to clean the data before you actually doing any science on it.

04:57 Now that you’re somewhat familiar with the books dataset, in the next lesson, you’ll be how to drop columns. That is, how to get rid of columns that you’re not interested in.

Udit Anadkat on Dec. 9, 2022

Hi Everyone, I am trying to read the csv file for books dataset but while I run the file I have a error staing file not found, could anyone help please?

Martin Breuss RP Team on Jan. 2, 2023

Hi @Udit Anadkat. Did you download the sample code from the Supporting Material dropdown?

You can find the CSV file in data-sets/BL-Flickr-Images-Book.csv.

Become a Member to join the conversation.