Handling Missing Data
00:00 When you’re working with real-world data, it’s very common that your dataset will contain missing values. pandas provides several ways for you to work when you’re calculating with missing data and for you to fill in the missing data with specified values and different methods to do so.
00:17 Now, there are several ways for you to represent when data is missing in a dataset. Let’s go ahead and create a DataFrame. It’s just going to contain one column.
00:28
We’ll call the label 'x'
for the column and we’ll pass in a list. Maybe this is a list that you’re reading in from a file, and if you left this entry in this list empty, here we’re going to get an error.
00:44 What happens is we need a way to specify or tell pandas that there’s missing data.
00:51
Now, this could represent maybe a list of answers by users, and one of the users didn’t submit an answer, but we still want to represent that. So a couple of ways to do that is to use the nan
object in NumPy.
01:06
In general, the NaN
value in computer science means Not a Number. Now, you can also do this with the built-in function float()
in Python, and that would create a NaN
value, not a number.
01:20
And also the math
module has a object called nan
. Now, we haven’t been using the math
module, and since we have been using the NumPy module, let’s use NumPy’s nan
object.
01:34
So if we run this instead, and then take a look at the DataFrame that we get, we get printed to the screen this NaN
value for Not a Number.
01:44
Now, the data set that you’re actually working with may contain missing values, and so you want to be able to encode this information in your dataset. Now, by default, if you start performing some statistical computations with this DataFrame—so for example, similar to what we did before in the previous lesson, say we want to compute the mean of the values in this DataFrame—then what pandas is going to do is ignore the NaN
value.
02:11
So in this case, pandas ignored the NaN
value. And here the values are 1
, 2
, and 4
, and the sum of that is 7
, and there were three of those values and so that was how the mean was computed—by simply ignoring the NaN
value.
02:26
Now, that’s the default behavior. You can pass in a keyword argument to tell pandas to not ignore any NaN
values. And this keyword argument is called skipna
. The default value is True
, so if you wanted to include in the computation that there was a NaN
value, then you would pass in a value of False
.
02:48
Now, whenever pandas or many other modules and general just computations reach a NaN
value, the entire computation comes out as NaN
as well. In this case, that’s what’s happening. For example, if you just want to see what’s 1
plus 2
plus, say, np.nan
, then you’re going to get a nan
value.
03:10
In certain situations, you may want to replace or fill in for a NaN
value with some default value. And in pandas, the method that does this is called .fillna()
. Now, .fillna()
has several options.
03:23
The default way to do this is to pass into the .fillna()
method a value for all of the NaN
values that you want to set. So for example, if you wanted to set all of the NaN
values to 0
, we would pass in to the keyword argument value
the value of 0
.
03:42
By default, this returns a new DataFrame where all of the NaN
values are replaced by the value that you’re passing in to the value
keyword.
03:52
Or you can also put the keyword argument inplace=True
, and then that would return a value of None
and replace in the df_
DataFrame all of the NaN
values with 0
, and so it would modify the DataFrame.
04:09
So, let me just keep it as inplace=False
so that we get a new DataFrame.
04:16
Now, sometimes instead of just setting a value of 0
, maybe what you want is to sort of continue with the previous non-NaN value—in other words, the previous actual numerical value—to sort of continue as the value that would get copied over onto any NaN
values.
04:35
So what I mean by that is we may want this NaN
value that existed to be the previous value that was a non-NaN value. We can do this instead using the method
keyword argument and then passing in a 'ffill'
(forward fill) to that value.
04:54
So this is a string, it’s a method that would take the previous non-NaN value—or in other words, the numeric value—and to replace any NaN
values with that one.
05:04
So if we run this instead, the value of 2.0
, which was the previous numeric value, is what’s being used to replace any NaN
values. Instead, we may want to use forward values to replace NaN
values. In other words, we would like to do a backwards fill. So in this case, the method would be 'bfill
, and so the NaN
value would be replaced by 4.0
.
05:32
Another common way to fill in missing values is to use mathematical interpolation. And this is essentially what interpolation is about, is when you only have, say, a sample of what you’re measuring and you’re also interested in values for some variable where you weren’t able to measure. For example, if we wanted to fill in the missing value—let’s see that DataFrame again—and sort of to continue this pattern would be fill in by the value in between 2.0
and 4.0
.
06:05
This would amount to what would be called linear interpolation. So if we called the .interpolate()
function, this would return a new DataFrame where the missing values are obtained by interpolating the previous and the value that follows the NaN
value—in this case, 2.0
and 4.0
.
06:28
Now, in certain situations, you may want to simply remove any row or any column that contains a NaN
value. pandas provides the .dropna()
method.
06:40
The default behavior of .dropna()
is to remove any row that contains a NaN
value. This is accomplished as well by passing in a value 0
to the axis
keyword. So whether you pass in 0
or not, if we run this,
06:55
we’re going to get a new DataFrame where that row—row label number 2
—that had a NaN
value is removed. Now, again, the axis
keyword controls whether it’s rows or columns that are deleted, the ones that have a NaN
value.
07:09
The default for axis
is 0
. So again, we get that DataFrame where row number two is removed. Now, if we pass in a value of 1
to the axis
keyword argument,
07:20
in this case, we’re going to get a DataFrame that has no data. In this case, it was because our DataFrame had only one column, and that column had a NaN
value. Now, regardless of whether you’re passing in a value of 1
or 0
to remove a row or column, you can pass in a value of True
for the inplace
keyword.
07:42
If we go ahead and, say, remove the row that has a NaN
value, this would modify the df_
DataFrame inplace, and so now it’s a DataFrame that has no rows containing a NaN
value, and so we’ve got that row label number 2
gone.
08:01 So, that’s a quick overview of a couple of the methods that pandas provides to work with missing data. Coming up next, we’re going to take a look at how you iterate over DataFrames, over the rows or the columns.
Become a Member to join the conversation.
robertportelli on Oct. 21, 2023
FutureWarning: DataFrame.fillna with ‘method’ is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. df_.fillna(method=’ffill’)
df_.fillna(method=’bfill’) == df_.bfill()