Filling Missing Values

Working With Missing Data in Polars Rodrigo Girão Serrão 08:45

00:00 Now you are able to identify missing data or missing values within your data. And the best case scenario is there’s no missing values. The second best case is you find those missing values, and then you are able to go back to the source of the data and fix the issue that caused those to appear in the first place.

00:20 This is the best possible outcome, but sometimes that’s not possible. And in that case, you will want to fill your missing data or replace those missing values with other values, and you will learn how to do so in this lesson. Go ahead and open a Jupyter Notebook.

00:40 You can keep on working on the notebook you were working in the previous lesson, or you can create a fresh notebook. And if you do, make sure you import Polars, SPL and read in the data that you are working with: tips.parquet file that you downloaded previously.

00:55 If you remember correctly, there’s three different columns with null values and those are the columns, “total”, “tip”, and “time”. And in this lesson, you will learn three different ways of replacing those null values with other reasonable, sensible values.

01:11 The first thing you’ll do is you’ll take a look at the missing values in the column total, just as a refresher, you want to see what those things look like.

01:21 So what are the rows where these values are missing? And in this case, you can see that the total value of the order is missing for two different rows. And depending on the context of what you’re doing with your data, you might think, well, maybe I can replace this with a zero, because it’s the worst case scenario is an order of $0.

01:45 So you might want to replace those values with a zero, or you might want to replace those order values with maybe the average order value. So depending on what you’re doing with your data, one of these might be reasonable.

01:58 So let’s assume you want to replace the null values in the column total with the number zero with a constant value. So you can do this in Polars. You can replace null values

02:11 with another constant value, in this case, the integer zero. So what you’ll do is you’ll be using the context with columns because you want to keep every single column the same.

02:22 You just want to modify the column total. And then you will refer to the column total and use the expression fill_null(), which has a pretty self-explanatory name.

02:34 And then you just pass in the value you want to put in those columns. You run this, you can see here that you are reassigning tips to tips so that we update the data.

02:45 You can run this. And then when you inspect tips, again, you can see that, for example, the row four no longer has the null value in the column “total”.

02:55 If you scroll up, you will see that it was there before,

02:59 but it’s no longer there. So now the column “total” has no null values.

03:06 The next thing you’ll do is fix the column “tip”.

03:10 You can easily see that there are four rows for which the column “tip” has no data.

03:22 These are the four rows. Now, you might think, well, for the purposes of statistics, it makes sense to replace these null values with an estimate of what the tip would’ve been.

03:35 You can, for example, compute the average tip percentage relative to the total order. You can compute that average percentage, and then you can fill these values based on the total of that specific order and the average tip value, this is something you can do in Polars.

03:52 So in Polars, you can replace

03:57 null values with values computed by an arbitrary expression. Obviously, you will want an expression that makes sense in your context, but you can pick the expression you need or want.

04:10 In this case, you will start by writing a short expression to compute the average tip percentage, which is essentially the ratio between the column of the tip and the column of the other total.

04:25 Then you take the average of that. So this is the average tip percentage. It’s an expression that computes that. And then you will want to update your tips DataFrame

04:35 by going into the column “tip”

04:38 and again, using the expression fill_null().

04:42 But this time, instead of typing here a constant value, you type an expression and Polars will use that expression to fill in the null values. So in this case, you want to take the column “total”

04:57 and you want to multiply the total by the average tip percentage, which should give you an estimate of the tip received for those orders. So you can run this, and if you inspect the DataFrame tips, it’s difficult to see what you did because the rows where the tip was missing were rows 4, 12, 18, 32, and for row 4, which is the only one that’s visible, you replace the null with a zero.

05:24 So it’s not very, it’s not a great example, but what you can do is to take a look at rows 12 and 18. You can use head() to grab the first 18 rows, and then tail() to grab the last seven of those first 18.

05:40 And now you can see on row 12, the tip was null, it’s now 2.88, and for row 18, you got the zero there. So this shows you how to use a dynamic arbitrary expression to compute values that you then replace with null values.

05:58 So these are two different ways, and there’s a third way that can be quite useful. Let’s take a look at the missing values in the column “time”.

06:08 So you take a look at the column “time”,

06:14 and you see that there are two rows for which the time is missing. Now, what you want to see here is that the context, the surrounding rows, might actually have a clue as to what you might want to do.

06:26 What you’ll do now is take a look at these record_ids and you will ask Polars to give you some rows around these values. So what you’ll do now is you’ll still filter to take a look at the columns, apologies at the rows, but this time you will filter by the record_id.

06:46 I was typing time, but you will filter by the record_id, and you want to check the record_id around 3 or around 15, so 2, 3, 4, and then 14, 15, 16, so that you can take a look at the surrounding context.

07:01 And looking at the surrounding context, it looks like rows are grouped by the time of the meal. So it might be reasonable to assume that this value should have been a dinner, and the second value should have been a lunch because that’s what they’re surrounded by.

07:22 And in that case, if that’s what makes sense in your context, you can propagate those values across a column to fill the null values. So you can fill null values by propagating

07:38 the values of the same column either forward or backward.

07:45 And the way you do this is by once more using the expression fill_null(). So you’ll refer to the column in question, and you’ll use the expression fill_null(), but this time you use the keyword argument strategy.

08:01 You type in strategy, and then you pick either forward or backward, depending on whether you want to use the values that are before or after, respectively.

08:10 The context is called with columns, not with column. You run this, and then if you take the filter from before, again, you can see that the dinner value was propagated forward, the value lunch was propagated forward, and you no longer have null values in your data.

08:28 If in your context, these are reasonable fixes, you might want to save your data at this point because you’ve fixed these issues. What you will learn in the next lesson is that sometimes you will want to ignore or drop the missing data.

Become a Member to join the conversation.