Join us and get access to thousands of tutorials and a community of expert Pythonistas.

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Ignoring or Dropping Missing Values

Working With Missing Data in Polars Rodrigo Girão Serrão 06:00

00:00 So now that you can identify and fill missing values, you will also learn how to ignore those missing values. Because sometimes the only thing you can do, or maybe the easiest, but also the sensible thing to do is to just ignore the missing data.

00:15 And in case that’s okay to do, Polars has some tools you will want to use.

00:21 Go ahead and open a Jupyter Notebook. You can keep working on the notebook you worked on in the previous lesson, or you can create a fresh one. The thing you will start by doing is realizing that in Polars, some operations will ignore null values.

00:38 For example, the column total contains a couple of null values. You can see row number four has a null value, and that doesn’t stop you from computing the mean of the column total.

00:52 As you can see, the mean of the column total is roughly 20.5, and Polars had no issue computing that even though there are missing values there. So some operations in Polars, and this is very important to keep in mind, will ignore or skip missing values.

01:10 And aggregations are a typical example of this. And on the other side of the same coin, some operations in Polars will result in a null if one or more of the, let’s call them operands or arguments, are nulls.

01:30 And one good example of that is arithmetic operations. Those typically result in a null. For example, suppose you want to compute the percentage of tip for each order.

01:42 You will compute this by computing the ratio of the column tip with the column total,

01:49 and then aliasing this to give it a better name. Let’s call it tip percentage.

01:57 And as you compute this, you can see that, for example, row four, because the total and the tip are null, the tip percentage is also null. And to make it even clearer, you can go ahead and filter the results to take a look at the rows for which the column tip is null.

02:15 And you will see that for all of those, the tip percentage is null. So arithmetic operations will not ignore null values. You will get a null as a result.

02:25 So for some operations, Polars will already do what’s required or what’s sensible, but you might also want to get rid of the null values altogether. And for that, you will now know that you can use the method drop_nulls()

02:41 to drop rows that contain at least a null value.

02:48 This is not an expression on a column; this is a method on your DataFrame. So if you do tips.drop_nulls() with no other arguments, you can see here that the shape of the DataFrame went from 180 rows to 174 because it dropped the six rows that contained null values.

03:09 For example, on the screen you can see you’re missing rows three and four.

03:15 But the function drop_nulls() can also be used to drop rows that contain missing values in any one of a specific set of columns.

03:31 For example, you know from the previous lesson that you can easily fill in the missing values from the column “time”. And in that case, you might want to drop only the rows for which either total or tip are null.

03:44 And you can do this by specifying a list of column names as an argument. So you can say "tip", "total",

03:53 and now you can see that the result has 176 rows. It kept the rows for which time is null. For example, row three was kept, but the other four rows where either the tip or the total were missing, those rows were completely dropped.

04:11 And finally, if you want to drop rows based on other criteria, you will typically have to use the context filter. For example, if you want to drop rows for which both the column total and the column tip are null, you will want filter() and then you’ll want to use the function all_horizontal() that you’ve seen in a previous lesson.

04:40 And then you’ll want to refer to the columns "tip" and "total" to check that they’re null.

04:47 And then the function all_horizontal() is going to combine those predicates, and the filter will give you all rows. And you’ll want to make sure that your method pl.all_horizontal() is prefixed with a tilde, which is the Boolean operator not, because the context filter will give you all of the rows for which the predicate is true.

05:13 And your predicate with all_horizontal() is finding all rows for which both columns are null. And those are the ones you want to drop. So you need the tilde to negate that condition.

05:24 Since you’re inverting the result, it’ll give you all rows other than the ones for which both the total and the tip are null. So using filter(), you have the most generic way of handling dropping rows.

05:39 And the function drop_nulls() will be very relevant and very useful in some common scenarios. So this is how you can drop some rows with missing values in Polars.

05:50 And up next, you will learn how to deal with NaNs, which are some very annoying values, but that are not missing values.

Become a Member to join the conversation.