How to Deal With Missing Data in Polars

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Working With Missing Data in Polars

Efficiently handling missing data in Polars is essential for keeping your datasets clean during analysis. Polars provides powerful tools to identify, replace, and remove null values, ensuring seamless data processing.

This tutorial covers practical techniques for managing missing data and highlights Polars’ capabilities to enhance your data analysis workflow. By following along, you’ll gain hands-on experience with these techniques and learn how to ensure your datasets are accurate and reliable.

By the end of this tutorial, you’ll understand that:

Polars allows you to handle missing data using LazyFrames and DataFrames.
You can check for null values in Polars using the .null_count() method.
NaN represents non-numeric values while null indicates missing data.
You can replace NaN in Polars by converting them to nulls and using .fill_null().
You can fix missing data by identifying, replacing, or removing null values.

Before you go any further, you’ll need some data. To begin with, you’ll use the tips.parquet file included in the downloadable materials that you can access by clicking the link below:

Get Your Code: Click here to download the free sample code that shows you how to deal with missing data in Polars.

The tips.parquet file is a doctored version of data publicly available from Kaggle. The dataset contains information about the tips collected at a fictitious restaurant over several days. Be sure to download it and place it in your project folder before getting started.

Note: The Parquet format is a format for storing large volumes of data. Disk size is minimized because of the compression algorithm it uses.

In addition, Parquet uses a columnar format and maintains metadata about each column’s content. This means columns can be searched very efficiently, often in parallel, without the need to search through the entire file.

The table below shows details about the columns in the tips.parquet file, along with their Polars data types. The text in parentheses beside each data type shows how these types are annotated in a DataFrame heading when Polars displays its results:

Column Name	Polars Data Type	Description
`record_id`	`Int64 (i64)`	Unique row identifier
`total`	`Float64 (f64)`	Bill total
`tip`	`Float64 (f64)`	Tip given
`gender`	`String (str)`	Diner’s gender
`smoker`	`Boolean (bool)`	Diner’s smoker status
`day`	`String (str)`	Day of meal
`time`	`String (str)`	Time of meal

As a starting point, you’ll investigate each of the columns in your data to find out whether or not they contain any null values. To use Polars, you first need to install the Polars library into your Python environment. To do this from a command prompt you use:

PS> python -m pip install polars

$ python -m pip install polars

In a Jupyter Notebook, the command becomes:

!python -m pip install polars

Either way, you can then begin to use the Polars library and all of its cool features. Here’s what the data looks like:

>>> import polars as pl

>>> tips = pl.scan_parquet("tips.parquet")

>>> tips.collect()
shape: (180, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 1         ┆ 28.97 ┆ 3.0  ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 2         ┆ 22.49 ┆ 3.5  ┆ Male   ┆ false  ┆ Fri ┆ Dinner │
│ 3         ┆ 5.75  ┆ 1.0  ┆ Female ┆ true   ┆ Fri ┆ null   │
│ 4         ┆ null  ┆ null ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 5         ┆ 22.75 ┆ 3.25 ┆ Female ┆ false  ┆ Fri ┆ Dinner │
│ …         ┆ …     ┆ …    ┆ …      ┆ …      ┆ …   ┆ …      │
│ 176       ┆ 40.55 ┆ 3.0  ┆ Male   ┆ true   ┆ Sun ┆ Dinner │
│ 177       ┆ 20.69 ┆ 5.0  ┆ Male   ┆ false  ┆ Sun ┆ Dinner │
│ 178       ┆ 20.9  ┆ 3.5  ┆ Female ┆ true   ┆ Sun ┆ Dinner │
│ 179       ┆ 30.46 ┆ 2.0  ┆ Male   ┆ true   ┆ Sun ┆ Dinner │
│ 180       ┆ 18.15 ┆ 3.5  ┆ Female ┆ true   ┆ Sun ┆ Dinner │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘

First of all, you import the Polars library into your program. It’s considered good practice to import it using the alias pl. You then read the content of the tips.parquet file into Polars. To do this, you use the scan_parquet() function. This reads the file’s data into a Polars LazyFrame.

Unlike traditional DataFrames that store data, LazyFrames contain only a set of instructions—called a query plan—that defines how the data should be processed. To see the actual data, you still need to read it into a Polars DataFrame. This is called materializing the LazyFrame and is achieved using the .collect() method.

Note: For a deeper dive into Polars LazyFrames and how to work with them, check out the How to Work With Polars LazyFrames tutorial.

Before a LazyFrame materializes its data, its query plan is optimized. For example, Polars can choose to only read some data from the data source if those are enough to fulfill the query. Also, when you define a LazyFrame containing multiple instructions, there are no delays while you create it because you don’t need to wait for earlier data reads to complete before adding new instructions. This makes LazyFrames the preferred approach in Polars.

The result has a shape of 180 rows and 7 columns. This is shown in the output as the shape of the LazyFrame.

Next, you need to figure out if there’s any missing data you need to deal with:

>>> (
...     tips
...     .null_count()
... ).collect()
shape: (1, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ ---       ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ --- ┆ ---  │
│ u32       ┆ u32   ┆ u32 ┆ u32    ┆ u32    ┆ u32 ┆ u32  │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
│ 0         ┆ 2     ┆ 4   ┆ 0      ┆ 0      ┆ 0   ┆ 2    │
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘

To check for the presence of nulls, you use .null_count() on your LazyFrame which adds in an instruction to find a count of the nulls in each column of your data. Normally, this would require a read of the entire file. However, because a Parquet file stores a count of nulls for each column in its metadata, obtaining the counts is instantaneous.

To actually trigger the data read, you again use the LazyFrame’s .collect() method. This will implement the optimized version of the plan contained within your LazyFrame to obtain the required data.

As you can see, the resulting columns contain counts of the various null values, this time represented by unsigned 32-bit integers (u32). You’ll deal with these in the next section.

Take the Quiz: Test your knowledge with our interactive “How to Deal With Missing Data in Polars” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

How to Deal With Missing Data in Polars

This quiz will test your knowledge of working with null data in Polars. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.

How to Work With Missing Data in Polars

When you’re dealing with missing data in Polars, there are several things you can do:

Recover it
Remove it
Replace it
Ignore it

By far the best plan of action is to review the techniques used to collect the source data, and try to find out why missing data exists in the first place. Perhaps it’s been caused by a manual data entry problem, or a failure of another system to provide the values. Once you know why it’s missing, you can take measures to ensure future data collections are complete, or even repeat the collection process and recover the current missing values.

If you can’t recover the missing data, the next best approach is to remove or replace it. Be aware that removing data may cause bias in calculations, so you should only do this if this isn’t a concern. A better option might be to replace it with an alternative, more usable value.

It’s also possible to ignore missing values. This approach might be fine in cases where you’re collecting text-based data that contains customer reviews or opinions. However, missing numerical data can skew calculations, and your code may even crash if it can’t cope with null values properly.

From your initial findings, you notice that there are two rows in the total column with nulls, and four in the tip column that you’ll deal with first. You decide to find out if there are any rows where there are null values in both of these columns:

>>> import polars as pl

>>> tips = pl.scan_parquet("tips.parquet")

>>> (
...     tips
...     .filter(
...         pl.col("total").is_null() & pl.col("tip").is_null()
...     )
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 4         ┆ null  ┆ null ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 18        ┆ null  ┆ null ┆ Female ┆ true   ┆ Fri ┆ Lunch  │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘

After you import the Polars library and create a tips LazyFrame, you add further instructions to filter out everything apart from any rows that contain a null in both their total and tip columns. You still need to use .collect() to materialize your LazyFrame into a DataFrame to see the results.

To create the filter, you use .filter() to specify a filter context and pass in an expression to define the criteria. In this case, the expression pl.col("total").is_null() & pl.col("tip").is_null() tells the LazyFrame which columns to analyze. The Boolean AND (&) ensures that only those rows whose tip and total values are both null are included in the result.

Note: Polars uses Python’s bitwise logical operators, &, |, and ~, to do elementwise Boolean operations. This is possible thanks to Python’s support for operator overloading.

One point to note is that applying the filter() method to tips doesn’t actually change the original tips LazyFrame in any way. Instead, it creates a fresh LazyFrame containing an updated set of instructions. This means several LazyFrames can be developed from the original tips LazyFrame by applying different methods to it.

Later in the tutorial, you’ll develop this idea further by adding successive refinements to a LazyFrame by appending multiple methods to a base LazyFrame. Each new method added has the effect of tweaking the LazyFrame further. This is commonly known as method chaining and is the recommended way of manipulating LazyFrames.

As you can see from the output, the records with a record_id of 4 and 18 meet the filter’s criteria.

After further investigation, you decide to drop these rows since they’re the only two where the total column contains a null value, as revealed by your earlier use of .null_count():

>>> (
...     tips
...     .drop_nulls(pl.col("total"))
...     .filter(
...         pl.col("total").is_null() & pl.col("tip").is_null()
...     )
... ).collect()
shape: (0, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ ---       ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ --- ┆ ---  │
│ i64       ┆ f64   ┆ f64 ┆ str    ┆ bool   ┆ str ┆ str  │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘

To drop these rows, you use the LazyFrame’s .drop_nulls() method. This adds an instruction into your LazyFrame to drop all rows whose total column contains null values. Performing the drop on only these null values works fine here because there are no other rows containing null in their total column. You’ll see later how to create an instruction to base the drop on the presence of nulls in both columns.

The .filter() method won’t be needed in the final code. You’re including it here to display the filtered data to confirm the rows have indeed been deleted. As you can see, this time the filter finds nothing.

Recall that your initial analysis revealed some nulls in the tip column that also have amounts in their total column. You decide these are genuine records so decide to update tip to contain zero for these records. This will make sure there won’t be any unnecessary bias should you need to analyze the tips later. To do this, you use .fill_null():

>>> (
...     tips
...     .drop_nulls(pl.col("total"))
...     .with_columns(pl.col("tip").fill_null(0))
...     .filter(pl.col("tip").is_null())
... ).collect()
shape: (0, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ ---       ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ --- ┆ ---  │
│ i64       ┆ f64   ┆ f64 ┆ str    ┆ bool   ┆ str ┆ str  │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘

You add another instruction to your LazyFrame, this time telling it to fill any existing null values in the tip column with the number 0. Then, you apply the instruction to the tip column using the .with_columns() context and specify your exact requirements as its parameter. You use with_columns() because you were changing the value of the column’s data.

To verify the changes, a filter is applied once again to confirm that the tip column no longer contains any null values. As with your earlier filter, this one should also be removed in the final version of the code.

Now that you’ve learned the basics of removing and replacing null values, it’s time to explore other Polars techniques for handling missing data.

Remove ads

Using a More Strategic Approach

In this section, you’ll deal with the null values that exist within the time column. As you saw earlier, there are two of them. First of all, you’ll take a look at the rows you need to deal with:

>>> import polars as pl

>>> tips = pl.scan_parquet("tips.parquet")

>>> (
...     tips
...     .filter(
...         pl.col("time").is_null()
...     )
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---  │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str  │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪══════╡
│ 3         ┆ 5.75  ┆ 1.0  ┆ Female ┆ true   ┆ Fri ┆ null │
│ 15        ┆ 8.58  ┆ 1.92 ┆ Male   ┆ true   ┆ Fri ┆ null │
└───────────┴───────┴──────┴────────┴────────┴─────┴──────┘

Using the .filter() method, as before, reveals that records with record_id values of 3 and 15 contain null values in their time column. To deal with these, you decide to look for clues in the surrounding data:

>>> (
...     tips
...     .filter(
...         pl.col("record_id").is_in([2, 3, 4, 14, 15, 16])
...     )
... ).collect()
shape: (6, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 2         ┆ 22.49 ┆ 3.5  ┆ Male   ┆ false  ┆ Fri ┆ Dinner │
│ 3         ┆ 5.75  ┆ 1.0  ┆ Female ┆ true   ┆ Fri ┆ null   │
│ 4         ┆ null  ┆ null ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 14        ┆ 13.42 ┆ 3.48 ┆ Female ┆ true   ┆ Fri ┆ Lunch  │
│ 15        ┆ 8.58  ┆ 1.92 ┆ Male   ┆ true   ┆ Fri ┆ null   │
│ 16        ┆ 15.98 ┆ 3.0  ┆ Female ┆ false  ┆ Fri ┆ Lunch  │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘

You may have noticed that a pair of null values have reappeared in the total and tip columns of record 4. This is because the filter is applied to the original tips LazyFrame. This problem will sort itself out when you add the new instructions to your modified LazyFrame.

What stands out this time is that the same value surrounds each null in the time column. You decide to fill each of these null values with their surrounding value. To do this, you can make use of the strategy parameter of .fill_null():

>>> (
...     tips
...     .drop_nulls("total")
...     .with_columns(pl.col("tip").fill_null(0))
...     .with_columns(pl.col("time").fill_null(strategy="forward"))
...     .filter(pl.col("record_id").is_in([3, 15]))
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 3         ┆ 5.75  ┆ 1.0  ┆ Female ┆ true   ┆ Fri ┆ Dinner │
│ 15        ┆ 8.58  ┆ 1.92 ┆ Male   ┆ true   ┆ Fri ┆ Lunch  │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────

Here, .with_columns() analyzes each row in the time column. When it reaches a cell whose time value is null, .fill_null() will replace the null using a forward strategy. This means it’ll forward-fill the null with the previous non-null value, which will be the Dinner value from record 2. Similarly, record 15 will have its null replaced with Lunch from record 14.

As before, so you can see the results of your work, you use a filter on these rows. The forward strategy is just one of several strategies that are available to you.

Next, you’ll look at how to deal with null data that exists across multiple columns.

Dealing With Nulls Across Multiple Columns

Earlier, you saw that records 4 and 18 in the original tips.parquet file have null values in both the total and tip columns. You removed them by using only the null values in the total column. This worked because these were the only two records with nulls in this column.

Suppose there were other rows where total contained a null value but tip had a valid entry. These would have been removed as well, even though they may not have needed to be. It would be safer to remove only those records where the null values exist in both columns. To do this, you can make use of the .all_horizontal() function:

>>> import polars as pl

>>> tips = pl.scan_parquet("tips.parquet")

>>> (
...     tips
...     .filter(
...         pl.all_horizontal(pl.col("total", "tip").is_null())
...     )
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 4         ┆ null  ┆ null ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 18        ┆ null  ┆ null ┆ Female ┆ true   ┆ Fri ┆ Lunch  │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘

The .all_horizontal() function allows you to compute a Boolean AND horizontally across columns. You use it here to select rows that have nulls in both their total and tip columns.

By passing in both total and tip columns to pl.col() and using .is_null(), you’re effectively passing both pl.col("total").is_null() and pl.col("tip").is_null() as two separate parameters to .all_horizontal(). Each parameter will yield a True or False value for each row depending on whether or not their total and tip columns contain null values.

If both parameters are True, .all_horizontal() will return True and that row will be included in the output. However, if either of these return False, .all_horizontal() will also return False for those rows, meaning they will be ignored in its output. That’s why no other rows are present—there are no other rows with both a total of null and a tip of null.

You used .all_horizontal() to produce the same two rows as before. If you want to go one step further and instruct your LazyFrame to ignore these rows, you just need to make one small change to your code:

>>> (
...     tips
...     .filter(
...         ~pl.all_horizontal(pl.col("total", "tip").is_null())
...     )
... ).collect()
shape: (178, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip  ┆ gender ┆ smoker ┆ day ┆ time   │
│ ---       ┆ ---   ┆ ---  ┆ ---    ┆ ---    ┆ --- ┆ ---    │
│ i64       ┆ f64   ┆ f64  ┆ str    ┆ bool   ┆ str ┆ str    │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 1         ┆ 28.97 ┆ 3.0  ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ 2         ┆ 22.49 ┆ 3.5  ┆ Male   ┆ false  ┆ Fri ┆ Dinner │
│ 3         ┆ 5.75  ┆ 1.0  ┆ Female ┆ true   ┆ Fri ┆ null   │
│ 5         ┆ 22.75 ┆ 3.25 ┆ Female ┆ false  ┆ Fri ┆ Dinner │
│ 6         ┆ 40.17 ┆ 4.73 ┆ Male   ┆ true   ┆ Fri ┆ Dinner │
│ …         ┆ …     ┆ …    ┆ …      ┆ …      ┆ …   ┆ …      │
│ 176       ┆ 40.55 ┆ 3.0  ┆ Male   ┆ true   ┆ Sun ┆ Dinner │
│ 177       ┆ 20.69 ┆ 5.0  ┆ Male   ┆ false  ┆ Sun ┆ Dinner │
│ 178       ┆ 20.9  ┆ 3.5  ┆ Female ┆ true   ┆ Sun ┆ Dinner │
│ 179       ┆ 30.46 ┆ 2.0  ┆ Male   ┆ true   ┆ Sun ┆ Dinner │
│ 180       ┆ 18.15 ┆ 3.5  ┆ Female ┆ true   ┆ Sun ┆ Dinner │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘

Can you spot the difference? Take a close look at the highlighted line. You added the NOT operator (~) to the all_horizontal() function. This has the effect of negating it.

In the earlier version of this code, you added instructions to your LazyFrame causing it to display only two rows. This time, the NOT operator ensures that the LazyFrame displays everything except those two rows. As a result, rows where both the total and tip columns contain null values are no longer included.

You could have used this code in place of the earlier version to remove these values immediately. The full version of your null-cleansing code now looks like this:

>>> import polars as pl

>>> tips = pl.scan_parquet("tips.parquet")

>>> (
...     tips
...     .filter(
...         ~pl.all_horizontal(pl.col("total", "tip").is_null())
...     )
...     .with_columns(pl.col("tip").fill_null(0))
...     .with_columns(pl.col("time").fill_null(strategy="forward"))
... ).null_count().collect()
shape: (1, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ ---       ┆ ---   ┆ --- ┆ ---    ┆ ---    ┆ --- ┆ ---  │
│ u32       ┆ u32   ┆ u32 ┆ u32    ┆ u32    ┆ u32 ┆ u32  │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
│ 0         ┆ 0     ┆ 0   ┆ 0      ┆ 0      ┆ 0   ┆ 0    │
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘

In a production environment, you’d probably use something like .collect() to view the cleansed version of your data, or .collect().write_parquet("tips_clean.parquet") to update your original data file with it. Here, you use .null_count().collect() so you can be satisfied that the original null values no longer exist.

Note: In addition to the all_horizontal() function, which gives you AND functionality, there’s also an any_horizontal() function that performs an OR if you need it. Similarly, in addition to .is_null(), there’s also is_not_null(). These offer you many permutations for dealing with null values across multiple columns.

Congratulations. You’ve successfully dealt with all of those annoying null values, but there are other techniques that you may find handy to know. The final method you’ll learn about for removing null values involves using the column’s data type.

Remove ads

Dealing With Nulls by Column Data Type

In this section, you’ll learn about Polars selectors. Selectors allow you to select columns based not only on their names but also on their data types and other properties.

To begin with, you’ll need a new LazyFrame. The code below creates one named scientists:

>>> import polars as pl
>>> import polars.selectors as cs

>>> scientists = pl.LazyFrame(
...     {
...         "scientist_id": [1, 2, 3, 4, 5],
...         "first_name": ["Isaac", "Louis", None, "Charles", "Marie"],
...         "last_name": [None, "Pasteur", "Einstein", "Darwin", "Curie"],
...         "birth_year": [1642, 1822, None, 1809, 1867],
...         "death_year": [1726, 1895, 1955, None, 1934],
...     }
... )

>>> scientists.collect()
shape: (5, 6)
┌──────────────┬────────────┬───────────┬────────────┬────────────┐
│ scientist_id ┆ first_name ┆ last_name ┆ birth_year ┆ death_year │
│ ---          ┆ ---        ┆ ---       ┆ ---        ┆ ---        │
│ i64          ┆ str        ┆ str       ┆ i64        ┆ i64        │
╞══════════════╪════════════╪═══════════╪════════════╪════════════╡
│ 1            ┆ Isaac      ┆ null      ┆ 1642       ┆ 1726       │
│ 2            ┆ Louis      ┆ Pasteur   ┆ 1822       ┆ 1895       │
│ 3            ┆ null       ┆ Einstein  ┆ null       ┆ 1955       │
│ 4            ┆ Charles    ┆ Darwin    ┆ 1809       ┆ null       │
│ 5            ┆ Marie      ┆ Curie     ┆ 1867       ┆ 1934       │
└──────────────┴────────────┴───────────┴────────────┴────────────┘

The LazyFrame contains details of five renowned scientists from various disciplines. While creating the LazyFrame, several null values were included in it. To create these null values, you used None instead of normal data values.

Suppose you wanted to update the null values to “Unknown” in those columns with a data type of str, and replace null values in the i64 columns with zero. To do this, you can use the Polars selectors:

>>> import polars.selectors as cs

>>> (
...     scientists
...     .with_columns(cs.string().fill_null("Unknown"))
...     .with_columns(cs.integer().fill_null(0))
... ).collect()
shape: (5, 6)
┌──────────────┬────────────┬───────────┬────────────┬────────────┐
│ scientist_id ┆ first_name ┆ last_name ┆ birth_year ┆ death_year │
│ ---          ┆ ---        ┆ ---       ┆ ---        ┆ ---        │
│ i64          ┆ str        ┆ str       ┆ i64        ┆ i64        │
╞══════════════╪════════════╪═══════════╪════════════╪════════════╡
│ 1            ┆ Isaac      ┆ Unknown   ┆ 1642       ┆ 1726       │
│ 2            ┆ Louis      ┆ Pasteur   ┆ 1822       ┆ 1895       │
│ 3            ┆ Unknown    ┆ Einstein  ┆    0       ┆ 1955       │
│ 4            ┆ Charles    ┆ Darwin    ┆ 1809       ┆    0       │
│ 5            ┆ Marie      ┆ Curie     ┆ 1867       ┆ 1934       │
└──────────────┴────────────┴───────────┴────────────┴────────────┘

To use selectors, you must first import polars.selectors. It’s standard practice to do so using an alias of cs. Think of this as short for column selectors.

As before, because you’re going to change a column, you once again use .with_columns(). This time, you’ll choose each of the string columns using the string data type selector cs.string(), and each of the integer columns using cs.integer(). To update the null values within these columns you again use .fill_null().

In this case, you replace the nulls in each of the str columns with the string Unknown, and each null in the i64 columns (or i32 columns if there were any) with a 0. If you look closely at the output, you’ll see the first_name and last_name columns no longer contain any null values. Instead, you’ve replaced them with Unknown. Also, the null values that were in birth_year and death_year have all been replaced with 0.

Although nulls can be a nuisance, they’re not the only thing that can mess you about. You’ll learn about another annoyance you may encounter next.

Dealing With Those Pesky NaNs and Infs

Sometimes, the data you’re analyzing contains the results of calculations. If incorrect data enters a calculation, it may cause erroneous results to appear. The most common examples are non-numeric values or NaNs, usually caused by trying to perform a numeric calculation in non-numeric data, and infinity values (inf, -inf), which are usually caused by a zero-division.

As with null data, you should try and find out why the invalid data has occurred and see if you can resolve the situation. If not, you’ll probably want to convert the offending values to null values, then use the techniques you learned earlier to remove them or replace them with something more suited to your analysis.

The code below scans the sales_trends.csv file included in your downloads into a LazyFrame named sales_trends:

>>> import polars as pl

>>> sales_trends = pl.scan_csv("sales_trends.csv")

>>> sales_trends.collect()
shape: (5, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ product ┆ last_year ┆ current_year ┆ next_year │
│ ---     ┆ ---       ┆ ---          ┆ ---       │
│ str     ┆ i64       ┆ i64          ┆ f64       │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ A       ┆ 17        ┆ 19           ┆ 29.0      │
│ B       ┆ 35        ┆ 35           ┆ NaN       │
│ C       ┆ 21        ┆ 19           ┆ null      │
│ D       ┆ 42        ┆ 50           ┆ -inf      │
│ E       ┆ 23        ┆ 25           ┆ inf       │
└─────────┴───────────┴──────────────┴───────────┘

This LazyFrame contains sales volumes for five products labeled A though E. The sales volumes for last_year and the current_year are complete. However, the next_year figures show some erroneous data. Before going any further with this data, you decide to replace it with null values:

>>> (
...     sales_trends
...     .with_columns(
...         pl.col("next_year").replace(
...             [float("inf"), -float("inf"), float("NaN")], None
...         )
...     )
... ).collect()
shape: (5, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ product ┆ last_year ┆ current_year ┆ next_year │
│ ---     ┆ ---       ┆ ---          ┆ ---       │
│ str     ┆ i64       ┆ i64          ┆ f64       │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ A       ┆ 17        ┆ 19           ┆ 29.0      │
│ B       ┆ 35        ┆ 35           ┆ null      │
│ C       ┆ 21        ┆ 19           ┆ null      │
│ D       ┆ 42        ┆ 50           ┆ null      │
│ E       ┆ 23        ┆ 25           ┆ null      │
└─────────┴───────────┴──────────────┴───────────┘

To replace each of the different offending data in the next_year column, you passed a list of them as a parameter to the replace() method. As you can see from the output, the unknown data is now set to null.

Now that all of the erroneous data has been replaced with null values, you can use .fill_null() to update them all using the same logic. For example, in the absence of anything better, you decide to predict next year’s data by extrapolating it from the known data:

>>> (
...     sales_trends
...     .with_columns(
...         pl.col("next_year").replace(
...             [float("inf"), -float("inf"), float("NaN")], None
...         )
...     )
...     .with_columns(
...         pl.col("next_year").fill_null(
...             pl.col("current_year")
...             + (pl.col("current_year") - pl.col("last_year"))
...         )
...     )
... ).collect()
shape: (5, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ product ┆ last_year ┆ current_year ┆ next_year │
│ ---     ┆ ---       ┆ ---          ┆ ---       │
│ str     ┆ i64       ┆ i64          ┆ f64       │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ A       ┆ 17        ┆ 19           ┆ 29.0      │
│ B       ┆ 35        ┆ 35           ┆ 35.0      │
│ C       ┆ 21        ┆ 19           ┆ 17.0      │
│ D       ┆ 42        ┆ 50           ┆ 58.0      │
│ E       ┆ 23        ┆ 25           ┆ 27.0      │
└─────────┴───────────┴──────────────┴───────────┘

Once the erroneous data has been replaced with null values, you add some additional instructions telling your LazyFrame how to deal with them. You replace each null by adjusting the current_year figure. You add it to the difference between the last_year and current_year sales figures to produce a next_year estimate. Your data has now been cleansed of those pesky NaNs and infs.

It’s time to show off your prowess—it’s time for your review.

Remove ads

Practicing Your Skills

Now it’s your turn to demonstrate your skills and have some fun handling null data in Polars. After leaving the Monty Python comedy troupe, after which the Python language is named, John Cleese wrote the successful television show Fawlty Towers. In this section, you’ll analyze some data relating to it.

The data you’ll need for this exercise is stored within the ft_exercise.parquet file that comes with your downloadables. It contains four columns of data relating to the episodes of Fawlty Towers:

Column Name	Polars Data Type	Description
`episode`	`Int64`	Episode number (1 - 12)
`series`	`Int64`	Series number (1 or 2)
`title`	`String`	Episode title
`original_date`	`Date`	First transmission date

You’ve received this data, but quickly notice that it contains some null values. You use the code shown below to reveal their volume:

>>> import polars as pl

>>> episodes = pl.scan_parquet("ft_exercise.parquet")
>>> episodes.null_count().collect()
shape: (1, 4)
┌─────────┬────────┬───────┬───────────────┐
│ episode ┆ series ┆ title ┆ original_date │
│ ---     ┆ ---    ┆ ---   ┆ ---           │
│ u32     ┆ u32    ┆ u32   ┆ u32           │
╞═════════╪════════╪═══════╪═══════════════╡
│ 0       ┆ 2      ┆ 2     ┆ 1             │
└─────────┴────────┴───────┴───────────────┘

After performing some further research, you decide to replace these null values using the following logic:

There are two nulls in the series column for episodes 6 and 7. Knowing that there were twelve episodes split evenly over two series, and that the data is in series order, see if you can replace the two null values with their correct data.
Two episode title values are missing. Episode 4 is named The Hotel Inspectors, and episode 9 is named Waldorf Salad. Replace these nulls with the correct data.
Episode 2 is missing its original_date value. See if you can work out how to insert the correct date using the existing information in the table. You may need to research this one.

You may also need to use your investigative skills to search the documentation for clues. Remember, that’s what it’s there for. Once you think you’ve solved each part, you might like to write it out to a CSV file to check that you’re on the right track. You can do this by calling .collect().write_csv("solution.csv") on your LazyFrame.

You’ll find one possible solution in the dropdown below. There’s also an ft_exercise_solution.csv file that contains the corrected records should you need some hints.

One possible solution is shown below:

>>> import polars as pl

>>> episodes = pl.scan_parquet("ft_exercise.parquet")

>>> (
...     episodes
...     .with_columns(
...         pl.when(pl.col("episode") == 6)
...         .then(pl.col("series").fill_null(strategy="forward"))
...         .otherwise(pl.col("series").fill_null(strategy="backward"))
...     )
...     .with_columns(
...         pl.when(pl.col("episode") == 4)
...         .then(pl.col("title").fill_null("The Hotel Inspectors"))
...         .otherwise(pl.col("title").fill_null("Waldorf Salad"))
...     )
...     .with_columns(
...         pl.col("original_date").interpolate()
...     )
... ).null_count().collect()
shape: (1, 4)
┌─────────┬────────┬───────┬───────────────┐
│ episode ┆ series ┆ title ┆ original_date │
│ ---     ┆ ---    ┆ ---   ┆ ---           │
│ u32     ┆ u32    ┆ u32   ┆ u32           │
╞═════════╪════════╪═══════╪═══════════════╡
│ 0       ┆ 0      ┆ 0     ┆ 0             │
└─────────┴────────┴───────┴───────────────┘

The when-then-otherwise expression is used to apply different fills using conditional logic similar to that provided by the if conditional statement when used with an else clause.

For the series column, the forward strategy is used to fill the null for episode 6. This means the previous non-null value of 1 is used as a replacement. A backward strategy is used for the other series null value relating to episode 7, meaning the next non-null value of 2 is used as a replacement. Given that there were two series of six episodes each, these are sensible options.

For the episode column, the null relating to episode 4 is replaced with The Hotel Inspectors, while Waldorf Salad replaces the null of episode 9. A little research will show these to be the correct episode titles.

Did you manage to get the missing date correct? Well done if you did. The update was made by selecting the original_date column and calling its .interpolate() method. This uses linear interpolation to estimate the missing values based on the surrounding values. Incidentally, if you thought it should be one of the strategy parameters, you’re not the only one.

Well done if you managed to complete that exercise—and indeed, well done for making it this far!

Conclusion

In this tutorial, you’ve learned various techniques for removing or replacing null values in Polars.

You also learned that while there are a range of ways for dealing with null values, the best way is to avoid them in the first place. In other words, see if you can find out what caused them and take steps to prevent this from happening in the future. You should also see if you can figure out what’s missing and try to recover it.

Congratulations on completing this tutorial. Enjoy applying these newfound skills to better prepare your LazyFrames for analysis.

Take the Quiz: Test your knowledge with our interactive “How to Deal With Missing Data in Polars” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

How to Deal With Missing Data in Polars

Get Your Code: Click here to download the free sample code that shows you how to deal with missing data in Polars.

Frequently Asked Questions

Now that you have some experience with handling missing data in Polars, you can use the questions and answers below to check your understanding and recap what you’ve learned.

These FAQs are related to the most important concepts you’ve covered in this tutorial. Click the Show/Hide toggle beside each question to reveal the answer.

You can check for null values in Polars using the .null_count() method, which provides a count of nulls in each column of your dataset.

NaN represents non-numeric values resulting from invalid calculations, while null indicates missing data within your dataset.

You replace NaN in Polars by converting them to nulls and then using the .fill_null() method to replace them with more suitable values.

You can fix missing data by identifying null values using .null_count(), then either replace them with .fill_null(), remove them with .drop_nulls(), or use other strategies to handle them.

To handle missing data, recover it if possible, replace it with a suitable value, or remove it if it doesn’t cause bias or issues in your analysis.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Working With Missing Data in Polars

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal.

Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!