Loading video player…

Identifying Missing Data

00:00 Now that you’re all set up, you can start with the very first step, which is to dig into your data and try and identify whether values are missing or not.

00:11 That’s the very first thing you need to do before you proceed and work with your missing data. So that’s what I’ll show you how to do in this lesson.

00:20 You will want to go ahead and open a Jupyter Notebook, and to make sure you can write your code exactly the same as mine. You will want to open your notebook session next to the folder materials.

00:34 You can see that the folder materials is inside the polars-missing-data folder, which contains the virtual environment where Jupyter is installed. So you’ll want to go ahead and ensure that your virtual environment is activated.

00:48 You can activate it by typing the command you see on screen source .venv/bin/activate. And once the environment is activated, you will want to run the command jupyter notebook.

01:02 This will output some warnings and some information, and then it’ll probably open a browser window. If it doesn’t, you have these URLs. You have some URLs at the end of the output that contain a long token.

01:19 You will want to copy one of those URLs and paste it in your preferred browser.

01:28 So running the command jupyter notebook should open a window like this, or if it didn’t, pasting the link should. And then you will want to go ahead and at the top left, you will click File New Notebook.

01:46 This creates a new empty notebook that you will be able to work on.

01:53 Once you have your notebook open, you’ll want to import polars as pl. And if you follow the steps in the previous lesson, this should work just fine.

02:02 You import polars as pl because that’s the standard, that’s standard practice. And after that, you will want to read in the dataset tips that you also downloaded in the previous lesson.

02:14 Go ahead and use the Polars function, read_parquet(). And the only argument you need to pass in is the path to the datasets. In this case, that’s going to be materials/tips.parquet.

02:30 And once you read it to make sure everything worked just fine, go ahead and take a look at the head of your DataFrame, which should show you the very first five rows and its seven columns.

02:42 This dataset contains some data about tips from orders at a restaurant. You will see a record_id, the total value of the order, the value of the tip, the gender of the customer, a Boolean value indicating whether the customer is a smoker or not, the day of the week at which the orders took place, and the time of the day, which could either be dinner or lunch.

03:04 And once you have that, you will use the function null_count() to count how many null values there are in each column. And these null values are how Polars represents missing data.

03:17 The special value null is how Polars represents missing data.

03:25 So this is what you’re looking for, and the null_count() will tell you that the columns total, tip, and time have some null values. What you want to do next is figure out where they are.

03:34 And for that, you’re going to use the function is_null(). So go ahead and let’s use the select() context to refer to the column “time”, and then use this function is_null(), this expression is_null(), and run it.

03:52 Once you run this, you should get a DataFrame with only one column that is as tall as the original DataFrame. And each Boolean value, each true or each false, will indicate whether the time of day was null in the original DataFrame.

04:09 And this expression is_null() from a computational point of view is free because Polars already stores this information. So it costs you a little bit more memory because Polars always has this information available, but from a computational point of view, this is free.

04:25 And what you can do is take the exact same code, but instead of the context select(), you can use the context filter(). And if again, you refer to the column “time” and you use the expression is_null(), what you’ll do is you will filter for the columns for which the time is null. And therefore you get a result with two rows, which are the only two rows where the time of day was missing.

04:54 But you’ll want to do this for all columns, not just the column “time”. So you need to do something else for that to work.

05:04 Go back to using the context select(), and now use pl.col() to refer to the three columns that had null values, which were “total”, “tip”, and “time”.

05:17 If you refer to these three columns and use the expression is_null(), you will get a DataFrame that contains three columns.

05:27 And each column has Boolean values.

05:31 For example, the third row has the values false, false, true. And this shows that the total was not null and the tip was not null, but the time of the day was null for this third row.

05:43 And what you want to do now is combine these three columns into a single column with Boolean values, indicating whether or not any column for that row is null.

05:55 And because you want to combine these Boolean values with the any() operation, you have to use the Polars function, any_horizontal(), horizontal because it’s along the rows.

06:07 So you go ahead, use the context select() again, and we’re going to keep the expression inside the select() the same, but this time we also put it inside the function any_horizontal().

06:19 So you will refer to the three columns, “total”, “tip”, and “time”, and use the expression is_null().

06:30 And if you run this, you get a single column with Boolean values that now you can use to filter. So you can go ahead, you can actually copy the code, you can paste it, and instead of using the context select(), you can use the context filter().

06:47 And if you run it, your result will be a DataFrame with six rows and seven columns, and it’ll show you all the rows that have some missing values.

06:58 So now that you’re able to identify these missing values, in the next lesson, I’ll show you how to work with them.

Become a Member to join the conversation.