Handling Missing Data
00:00 When you’re working with real-world data, it’s very common that your dataset will contain missing values. pandas provides several ways for you to work when you’re calculating with missing data and for you to fill in the missing data with specified values and different methods to do so.
00:17 Now, there are several ways for you to represent when data is missing in a dataset. Let’s go ahead and create a DataFrame. It’s just going to contain one column.
We’ll call the label
'x' for the column and we’ll pass in a list. Maybe this is a list that you’re reading in from a file, and if you left this entry in this list empty, here we’re going to get an error.
00:44 What happens is we need a way to specify or tell pandas that there’s missing data.
Now, this could represent maybe a list of answers by users, and one of the users didn’t submit an answer, but we still want to represent that. So a couple of ways to do that is to use the
nan object in NumPy.
In general, the
NaN value in computer science means Not a Number. Now, you can also do this with the built-in function
float() in Python, and that would create a
NaN value, not a number.
And also the
math module has a object called
nan. Now, we haven’t been using the
math module, and since we have been using the NumPy module, let’s use NumPy’s
So if we run this instead, and then take a look at the DataFrame that we get, we get printed to the screen this
NaN value for Not a Number.
Now, the data set that you’re actually working with may contain missing values, and so you want to be able to encode this information in your dataset. Now, by default, if you start performing some statistical computations with this DataFrame—so for example, similar to what we did before in the previous lesson, say we want to compute the mean of the values in this DataFrame—then what pandas is going to do is ignore the
So in this case, pandas ignored the
NaN value. And here the values are
4, and the sum of that is
7, and there were three of those values and so that was how the mean was computed—by simply ignoring the
Now, that’s the default behavior. You can pass in a keyword argument to tell pandas to not ignore any
NaN values. And this keyword argument is called
skipna. The default value is
True, so if you wanted to include in the computation that there was a
NaN value, then you would pass in a value of
Now, whenever pandas or many other modules and general just computations reach a
NaN value, the entire computation comes out as
NaN as well. In this case, that’s what’s happening. For example, if you just want to see what’s
2 plus, say,
np.nan, then you’re going to get a
In certain situations, you may want to replace or fill in for a
NaN value with some default value. And in pandas, the method that does this is called
.fillna() has several options.
The default way to do this is to pass into the
.fillna() method a value for all of the
NaN values that you want to set. So for example, if you wanted to set all of the
NaN values to
0, we would pass in to the keyword argument
value the value of
By default, this returns a new DataFrame where all of the
NaN values are replaced by the value that you’re passing in to the
Or you can also put the keyword argument
inplace=True, and then that would return a value of
None and replace in the
df_ DataFrame all of the
NaN values with
0, and so it would modify the DataFrame.
So, let me just keep it as
inplace=False so that we get a new DataFrame.
Now, sometimes instead of just setting a value of
0, maybe what you want is to sort of continue with the previous non-NaN value—in other words, the previous actual numerical value—to sort of continue as the value that would get copied over onto any
So what I mean by that is we may want this
NaN value that existed to be the previous value that was a non-NaN value. We can do this instead using the
method keyword argument and then passing in a
'ffill' (forward fill) to that value.
So this is a string, it’s a method that would take the previous non-NaN value—or in other words, the numeric value—and to replace any
NaN values with that one.
So if we run this instead, the value of
2.0, which was the previous numeric value, is what’s being used to replace any
NaN values. Instead, we may want to use forward values to replace
NaN values. In other words, we would like to do a backwards fill. So in this case, the method would be
'bfill, and so the
NaN value would be replaced by
Another common way to fill in missing values is to use mathematical interpolation. And this is essentially what interpolation is about, is when you only have, say, a sample of what you’re measuring and you’re also interested in values for some variable where you weren’t able to measure. For example, if we wanted to fill in the missing value—let’s see that DataFrame again—and sort of to continue this pattern would be fill in by the value in between
This would amount to what would be called linear interpolation. So if we called the
.interpolate() function, this would return a new DataFrame where the missing values are obtained by interpolating the previous and the value that follows the
NaN value—in this case,
Now, in certain situations, you may want to simply remove any row or any column that contains a
NaN value. pandas provides the
The default behavior of
.dropna() is to remove any row that contains a
NaN value. This is accomplished as well by passing in a value
0 to the
axis keyword. So whether you pass in
0 or not, if we run this,
we’re going to get a new DataFrame where that row—row label number
2—that had a
NaN value is removed. Now, again, the
axis keyword controls whether it’s rows or columns that are deleted, the ones that have a
The default for
0. So again, we get that DataFrame where row number two is removed. Now, if we pass in a value of
1 to the
axis keyword argument,
in this case, we’re going to get a DataFrame that has no data. In this case, it was because our DataFrame had only one column, and that column had a
NaN value. Now, regardless of whether you’re passing in a value of
0 to remove a row or column, you can pass in a value of
True for the
If we go ahead and, say, remove the row that has a
NaN value, this would modify the
df_ DataFrame inplace, and so now it’s a DataFrame that has no rows containing a
NaN value, and so we’ve got that row label number
08:01 So, that’s a quick overview of a couple of the methods that pandas provides to work with missing data. Coming up next, we’re going to take a look at how you iterate over DataFrames, over the rows or the columns.
Become a Member to join the conversation.