Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Handling Missing Data

Cesar Aguilar

The pandas DataFrame: Working With Data Efficiently Cesar Aguilar 08:16

Transcript
Discussion (1)

00:00 When you’re working with real-world data, it’s very common that your dataset will contain missing values. pandas provides several ways for you to work when you’re calculating with missing data and for you to fill in the missing data with specified values and different methods to do so.

00:17 Now, there are several ways for you to represent when data is missing in a dataset. Let’s go ahead and create a DataFrame. It’s just going to contain one column.

00:28 We’ll call the label 'x' for the column and we’ll pass in a list. Maybe this is a list that you’re reading in from a file, and if you left this entry in this list empty, here we’re going to get an error.

00:44 What happens is we need a way to specify or tell pandas that there’s missing data.

00:51 Now, this could represent maybe a list of answers by users, and one of the users didn’t submit an answer, but we still want to represent that. So a couple of ways to do that is to use the nan object in NumPy.

01:06 In general, the NaN value in computer science means Not a Number. Now, you can also do this with the built-in function float() in Python, and that would create a NaN value, not a number.

01:20 And also the math module has a object called nan. Now, we haven’t been using the math module, and since we have been using the NumPy module, let’s use NumPy’s nan object.

01:34 So if we run this instead, and then take a look at the DataFrame that we get, we get printed to the screen this NaN value for Not a Number.

01:44 Now, the data set that you’re actually working with may contain missing values, and so you want to be able to encode this information in your dataset. Now, by default, if you start performing some statistical computations with this DataFrame—so for example, similar to what we did before in the previous lesson, say we want to compute the mean of the values in this DataFrame—then what pandas is going to do is ignore the NaN value.

02:11 So in this case, pandas ignored the NaN value. And here the values are 1, 2, and 4, and the sum of that is 7, and there were three of those values and so that was how the mean was computed—by simply ignoring the NaN value.

02:26 Now, that’s the default behavior. You can pass in a keyword argument to tell pandas to not ignore any NaN values. And this keyword argument is called skipna. The default value is True, so if you wanted to include in the computation that there was a NaN value, then you would pass in a value of False.

02:48 Now, whenever pandas or many other modules and general just computations reach a NaN value, the entire computation comes out as NaN as well. In this case, that’s what’s happening. For example, if you just want to see what’s 1 plus 2 plus, say, np.nan, then you’re going to get a nan value.

03:10 In certain situations, you may want to replace or fill in for a NaN value with some default value. And in pandas, the method that does this is called .fillna(). Now, .fillna() has several options.

03:23 The default way to do this is to pass into the .fillna() method a value for all of the NaN values that you want to set. So for example, if you wanted to set all of the NaN values to 0, we would pass in to the keyword argument value the value of 0.

03:42 By default, this returns a new DataFrame where all of the NaN values are replaced by the value that you’re passing in to the value keyword.

03:52 Or you can also put the keyword argument inplace=True, and then that would return a value of None and replace in the df_ DataFrame all of the NaN values with 0, and so it would modify the DataFrame.

04:09 So, let me just keep it as inplace=False so that we get a new DataFrame.

04:16 Now, sometimes instead of just setting a value of 0, maybe what you want is to sort of continue with the previous non-NaN value—in other words, the previous actual numerical value—to sort of continue as the value that would get copied over onto any NaN values.

04:35 So what I mean by that is we may want this NaN value that existed to be the previous value that was a non-NaN value. We can do this instead using the method keyword argument and then passing in a 'ffill' (forward fill) to that value.

04:54 So this is a string, it’s a method that would take the previous non-NaN value—or in other words, the numeric value—and to replace any NaN values with that one.

05:04 So if we run this instead, the value of 2.0, which was the previous numeric value, is what’s being used to replace any NaN values. Instead, we may want to use forward values to replace NaN values. In other words, we would like to do a backwards fill. So in this case, the method would be 'bfill, and so the NaN value would be replaced by 4.0.

05:32 Another common way to fill in missing values is to use mathematical interpolation. And this is essentially what interpolation is about, is when you only have, say, a sample of what you’re measuring and you’re also interested in values for some variable where you weren’t able to measure. For example, if we wanted to fill in the missing value—let’s see that DataFrame again—and sort of to continue this pattern would be fill in by the value in between 2.0 and 4.0.

06:05 This would amount to what would be called linear interpolation. So if we called the .interpolate() function, this would return a new DataFrame where the missing values are obtained by interpolating the previous and the value that follows the NaN value—in this case, 2.0 and 4.0.

06:28 Now, in certain situations, you may want to simply remove any row or any column that contains a NaN value. pandas provides the .dropna() method.

06:40 The default behavior of .dropna() is to remove any row that contains a NaN value. This is accomplished as well by passing in a value 0 to the axis keyword. So whether you pass in 0 or not, if we run this,

06:55 we’re going to get a new DataFrame where that row—row label number 2—that had a NaN value is removed. Now, again, the axis keyword controls whether it’s rows or columns that are deleted, the ones that have a NaN value.

07:09 The default for axis is 0. So again, we get that DataFrame where row number two is removed. Now, if we pass in a value of 1 to the axis keyword argument,

07:20 in this case, we’re going to get a DataFrame that has no data. In this case, it was because our DataFrame had only one column, and that column had a NaN value. Now, regardless of whether you’re passing in a value of 1 or 0 to remove a row or column, you can pass in a value of True for the inplace keyword.

07:42 If we go ahead and, say, remove the row that has a NaN value, this would modify the df_ DataFrame inplace, and so now it’s a DataFrame that has no rows containing a NaN value, and so we’ve got that row label number 2 gone.

08:01 So, that’s a quick overview of a couple of the methods that pandas provides to work with missing data. Coming up next, we’re going to take a look at how you iterate over DataFrames, over the rows or the columns.

robertportelli on Oct. 21, 2023

FutureWarning: DataFrame.fillna with ‘method’ is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. df_.fillna(method=’ffill’)

df_.fillna(method=’bfill’) == df_.bfill()

Become a Member to join the conversation.