# Handling Missing Data

**00:00**
When you’re working with real-world data, it’s very common that your dataset will contain missing values. pandas provides several ways for you to work when you’re calculating with missing data and for you to fill in the missing data with specified values and different methods to do so.

**00:17**
Now, there are several ways for you to represent when data is missing in a dataset. Let’s go ahead and create a DataFrame. It’s just going to contain one column.

**00:28**
We’ll call the label `'x'`

for the column and we’ll pass in a list. Maybe this is a list that you’re reading in from a file, and if you left this entry in this list empty, here we’re going to get an error.

**00:44**
What happens is we need a way to specify or tell pandas that there’s missing data.

**00:51**
Now, this could represent maybe a list of answers by users, and one of the users didn’t submit an answer, but we still want to represent that. So a couple of ways to do that is to use the `nan`

object in NumPy.

**01:06**
In general, the `NaN`

value in computer science means *Not a Number*. Now, you can also do this with the built-in function `float()`

in Python, and that would create a `NaN`

value, not a number.

**01:20**
And also the `math`

module has a object called `nan`

. Now, we haven’t been using the `math`

module, and since we have been using the NumPy module, let’s use NumPy’s `nan`

object.

**01:34**
So if we run this instead, and then take a look at the DataFrame that we get, we get printed to the screen this `NaN`

value for *Not a Number*.

**01:44**
Now, the data set that you’re actually working with may contain missing values, and so you want to be able to encode this information in your dataset. Now, by default, if you start performing some statistical computations with this DataFrame—so for example, similar to what we did before in the previous lesson, say we want to compute the mean of the values in this DataFrame—then what pandas is going to do is ignore the `NaN`

value.

**02:11**
So in this case, pandas ignored the `NaN`

value. And here the values are `1`

, `2`

, and `4`

, and the sum of that is `7`

, and there were three of those values and so that was how the mean was computed—by simply ignoring the `NaN`

value.

**02:26**
Now, that’s the default behavior. You can pass in a keyword argument to tell pandas to not ignore any `NaN`

values. And this keyword argument is called `skipna`

. The default value is `True`

, so if you wanted to include in the computation that there was a `NaN`

value, then you would pass in a value of `False`

.

**02:48**
Now, whenever pandas or many other modules and general just computations reach a `NaN`

value, the entire computation comes out as `NaN`

as well. In this case, that’s what’s happening. For example, if you just want to see what’s `1`

plus `2`

plus, say, `np.nan`

, then you’re going to get a `nan`

value.

**03:10**
In certain situations, you may want to replace or fill in for a `NaN`

value with some default value. And in pandas, the method that does this is called `.fillna()`

. Now, `.fillna()`

has several options.

**03:23**
The default way to do this is to pass into the `.fillna()`

method a value for all of the `NaN`

values that you want to set. So for example, if you wanted to set all of the `NaN`

values to `0`

, we would pass in to the keyword argument `value`

the value of `0`

.

**03:42**
By default, this returns a new DataFrame where all of the `NaN`

values are replaced by the value that you’re passing in to the `value`

keyword.

**03:52**
Or you can also put the keyword argument `inplace=True`

, and then that would return a value of `None`

and replace in the `df_`

DataFrame all of the `NaN`

values with `0`

, and so it would modify the DataFrame.

**04:09**
So, let me just keep it as `inplace=False`

so that we get a new DataFrame.

**04:16**
Now, sometimes instead of just setting a value of `0`

, maybe what you want is to sort of continue with the previous non-NaN value—in other words, the previous actual numerical value—to sort of continue as the value that would get copied over onto any `NaN`

values.

**04:35**
So what I mean by that is we may want this `NaN`

value that existed to be the previous value that was a non-NaN value. We can do this instead using the `method`

keyword argument and then passing in a `'ffill'`

(forward fill) to that value.

**04:54**
So this is a string, it’s a method that would take the previous non-NaN value—or in other words, the numeric value—and to replace any `NaN`

values with that one.

**05:04**
So if we run this instead, the value of `2.0`

, which was the previous numeric value, is what’s being used to replace any `NaN`

values. Instead, we may want to use forward values to replace `NaN`

values. In other words, we would like to do a backwards fill. So in this case, the method would be `'bfill`

, and so the `NaN`

value would be replaced by `4.0`

.

**05:32**
Another common way to fill in missing values is to use mathematical interpolation. And this is essentially what interpolation is about, is when you only have, say, a sample of what you’re measuring and you’re also interested in values for some variable where you weren’t able to measure. For example, if we wanted to fill in the missing value—let’s see that DataFrame again—and sort of to continue this pattern would be fill in by the value in between `2.0`

and `4.0`

.

**06:05**
This would amount to what would be called linear interpolation. So if we called the `.interpolate()`

function, this would return a new DataFrame where the missing values are obtained by interpolating the previous and the value that follows the `NaN`

value—in this case, `2.0`

and `4.0`

.

**06:28**
Now, in certain situations, you may want to simply remove any row or any column that contains a `NaN`

value. pandas provides the `.dropna()`

method.

**06:40**
The default behavior of `.dropna()`

is to remove any row that contains a `NaN`

value. This is accomplished as well by passing in a value `0`

to the `axis`

keyword. So whether you pass in `0`

or not, if we run this,

**06:55**
we’re going to get a new DataFrame where that row—row label number `2`

—that had a `NaN`

value is removed. Now, again, the `axis`

keyword controls whether it’s rows or columns that are deleted, the ones that have a `NaN`

value.

**07:09**
The default for `axis`

is `0`

. So again, we get that DataFrame where row number two is removed. Now, if we pass in a value of `1`

to the `axis`

keyword argument,

**07:20**
in this case, we’re going to get a DataFrame that has no data. In this case, it was because our DataFrame had only one column, and that column had a `NaN`

value. Now, regardless of whether you’re passing in a value of `1`

or `0`

to remove a row or column, you can pass in a value of `True`

for the `inplace`

keyword.

**07:42**
If we go ahead and, say, remove the row that has a `NaN`

value, this would modify the `df_`

DataFrame inplace, and so now it’s a DataFrame that has no rows containing a `NaN`

value, and so we’ve got that row label number `2`

gone.

**08:01**
So, that’s a quick overview of a couple of the methods that pandas provides to work with missing data. Coming up next, we’re going to take a look at how you iterate over DataFrames, over the rows or the columns.

Become a Member to join the conversation.

robertportellion Oct. 21, 2023FutureWarning: DataFrame.fillna with ‘method’ is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. df_.fillna(method=’ffill’)

df_.fillna(method=’bfill’) == df_.bfill()