Cleaning the Date Column
In this lesson, you’re working with data types. To learn more, check out the pandas dtypes documentation.
00:00 In this lesson, you’re going to be cleaning the date column of the books dataset. This column is formatted as a string, but what you want this column to be is a column of numbers representing the year.
00:12 The date column has a few things that need to be done to it for cleaning. The date of publication column has both numbers and strings. However, they are represented as a string within pandas. This isn’t very efficient.
00:26 If things are numbers, you want pandas to recognize them as numbers. You’ll also want to hold just a single year. There are some entries which have two dates, one in square brackets, or they have some extra information that is just noise at this point.
00:40 You’ll need a way to find the right number to interpret as a year to extract it. That should give you a clue as to what method we should be using. Then after that, you’ll want to convert the data type to a numerical type.
00:53
About DataFrame
data types: pandas will try to detect types automatically, but it’s not magic. Sometimes it’s not going to be able to detect the type, or it will find the all-encompassing type of object
, which can handle all sorts of values.
01:10 You can, after the fact, set data types manually, as long as pandas can easily convert it, so if you have numbers stored as strings and you ask pandas to store it as numbers, then it should be able to convert it, no problem.
01:23
There are many data types referred to as dtypes
, and the object
data type, as mentioned, is a catch-all, but it’s not as performant.
01:33 To learn more about data types, you can look at the User Guide, Essential basic functionality, and the dtypes section, the link for which will be in the course notes below.
01:45 Back to the data cleaning script here, and the next thing on the menu is the date column. Sometimes when you’re working these things out, because you’re not quite sure exactly how this is going to work out, you can just create a temporary cell here with a comment and a double percent sign.
02:00
The first thing you’ll want to do to start exploring what you’re going to be doing is to just select the data. Select the right column, which is "date_of_publication"
.
02:14 I’m pressing Control + Enter to run that. And that looks like the right column, and you can see there’s the data that we want to clean up: square brackets, commas, stuff in there. You’re just interested in the first four numbers there.
02:28
Since it’s a series, you can use the str.extract()
function as you’ve done before. The regex in this case, raw string and a capture group, because what the .extract()
does is it extracts each capture group into its own row. Here you’re going to be using the escape character (`) and
d` for digit, so that means a numerical character.
02:49
And then these curly braces ({}
) with a number in the middle mean 4
of the previous items. So that’s four digits here. So if you run this and have a look at the result, as you could see, it looks much cleaner. There’s four numbers of each, although there’s a lot of abbreviated stuff here and you don’t really want to check that one by one, so what you can do is there’s a special method called .isna()
, which will check if there’s any null
values in there.
03:17
Maybe there wasn’t any value there to start off with, or if str.extract()
failed for some reason, and then you can sum up those values, which will sum up all the true values.
03:27
So most of these will be false, but there’ll be some true ones if it failed in any case. So if you can run this, as you can see, there’s 183
.
03:37 Now that’s not so bad considering there’s 8,000 strings. So you can just leave that for now, although what you might want to do is to fill those numbers so they are actually some kind of number that makes sense so you can change this column into a numerical data type.
03:53
So you can .fillna()
with 0
, which will mean that if there is a value that is null
, then that will just be replaced with 0
. Running this … You can see that it’s still got the strings, but if you then
04:13
chain on this on the end and run it, you’ll see that there’s now 0
values that have null
in them.
04:24
Great. Okay, so now that we have all of these strings that represent numbers, you can use a method from the main library called pd.to_numeric()
, and then you can pass in all of this good stuff.
04:41
Okay, so now running this, and oh dear, there is an error here. What you can do is look at the bottom here and it says TypeError
. It must be a list, tuple, 1-d array, or Series
. Hmm.
04:55
So what’s going on here? Now, this is an interesting error. Just comment out this .to_numeric()
casting first,
05:06
and then have a look at this. So how about wrapping this in a type
to make sure that this is returning what we think it is? So the location indexer returns a series. You think str.extract()
returns a series?
05:23
Maybe that’s where the problem is coming from. So running this, and indeed it says that this is a core DataFrame. Why is this? Well, because str.extract()
will extract the capture groups.
05:38
So if we have three capture groups here, it will actually produce three columns. So the way to override this is to pass another keyword argument into the extract()
function here and say False
, because this is True
by default. So now when you run this, you should get a series, which is what you need to pass to .to_numeric()
.
06:07
So now you can take out the type
, uncomment these
06:18 and congratulations, you have converted all of these into an integer type. Now this is quite a large integer, and it’s an integer that is signed. So it means it can be negative numbers, and you don’t really want negative years.
06:33
That’s not really a valid entry. You don’t think there are any negative ones in here. So there is another argument. The second argument to numeric()
that you can pass is the downcast
argument, and this accepts a variety of string arguments, but you’re looking for the "unsigned"
one for now. And now if you run this,
06:57
you’ll see that this is a unsigned integer of 16 bits, which is a much smaller data type, so that will mean your operations will run much quicker here. That’s looking good. Now what you can do is actually, since this is quite a lot to put in here, what you can do is write a function. Let’s call it clean_date_of_publication()
, and it will accept books
, the books
DataFrame, and then this return the whole thing. Okay, so now this is ready to accept a DataFrame and return it cleaned. Well, return a date_of_publication
column, which means that now this can be used with the .assign()
.
07:44
So you can chain an .assign()
method onto here. date_of_publication
is the column as a keyword argument, and you can pass in the function directly.
08:01
To recap this, you are passing in the whole DataFrame via .assign()
. This then selects the date_of_publication
column. It applies a str.extract()
operation with expand=False
, just to make sure that it returns a series and not a DataFrame, because a str.extract()
may return various columns.
08:21
It will fill any null
values with 0
. Then it will wrap the whole thing in a call to pd.to_numeric()
, and it will pass a keyword argument so that it will pick the lowest unsigned data type that it can to hold these numbers.
08:38 So now if you run the whole data cleaning script …
08:43
and oh dear, there’s a error. So let’s look at what here. 'clean_date_of_publication' is not defined
. Okay, yes, because the function needs to go before calling it or else it won’t know where it is. So save, run.
09:00
That seems to have worked just fine. books.head()
.
09:08
And there you go. Now you have a relatively clean date_of_publication
.
09:15
So you’ve gone through, extracted the first set of numbers for the date column, filled in null
values, and converted the date column to an integer column. In the next lesson, you’re going to be cleaning text based on rules.
Become a Member to join the conversation.
shoebptl on Sept. 1, 2023
Is there any tradition in pandas to fill the null values with the rolling function rather then fillna or interpolate?