Efficiently handling missing data in Polars is essential for keeping your datasets clean during analysis. Polars provides powerful tools to identify, replace, and remove null values, ensuring seamless data processing.
This tutorial covers practical techniques for managing missing data and highlights Polars’ capabilities to enhance your data analysis workflow. By following along, you’ll gain hands-on experience with these techniques and learn how to ensure your datasets are accurate and reliable.
By the end of this tutorial, you’ll understand that:
- Polars allows you to handle missing data using LazyFrames and DataFrames.
- You can check for null values in Polars using the
.null_count()
method. - NaN represents non-numeric values while null indicates missing data.
- You can replace NaN in Polars by converting them to nulls and using
.fill_null()
. - You can fix missing data by identifying, replacing, or removing null values.
Before you go any further, you’ll need some data. To begin with, you’ll use the tips.parquet
file included in the downloadable materials that you can access by clicking the link below:
Get Your Code: Click here to download the free sample code that shows you how to deal with missing data in Polars.
The tips.parquet
file is a doctored version of data publicly available from Kaggle. The dataset contains information about the tips collected at a fictitious restaurant over several days. Be sure to download it and place it in your project folder before getting started.
Note: The Parquet format is a format for storing large volumes of data. Disk size is minimized because of the compression algorithm it uses.
In addition, Parquet uses a columnar format and maintains metadata about each column’s content. This means columns can be searched very efficiently, often in parallel, without the need to search through the entire file.
The table below shows details about the columns in the tips.parquet
file, along with their Polars data types. The text in parentheses beside each data type shows how these types are annotated in a DataFrame heading when Polars displays its results:
Column Name | Polars Data Type | Description |
---|---|---|
record_id |
Int64 (i64) |
Unique row identifier |
total |
Float64 (f64) |
Bill total |
tip |
Float64 (f64) |
Tip given |
gender |
String (str) |
Diner’s gender |
smoker |
Boolean (bool) |
Diner’s smoker status |
day |
String (str) |
Day of meal |
time |
String (str) |
Time of meal |
As a starting point, you’ll investigate each of the columns in your data to find out whether or not they contain any null values. To use Polars, you first need to install the Polars library into your Python environment. To do this from a command prompt you use:
In a Jupyter Notebook, the command becomes:
!python -m pip install polars
Either way, you can then begin to use the Polars library and all of its cool features. Here’s what the data looks like:
>>> import polars as pl
>>> tips = pl.scan_parquet("tips.parquet")
>>> tips.collect()
shape: (180, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 1 ┆ 28.97 ┆ 3.0 ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 2 ┆ 22.49 ┆ 3.5 ┆ Male ┆ false ┆ Fri ┆ Dinner │
│ 3 ┆ 5.75 ┆ 1.0 ┆ Female ┆ true ┆ Fri ┆ null │
│ 4 ┆ null ┆ null ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 5 ┆ 22.75 ┆ 3.25 ┆ Female ┆ false ┆ Fri ┆ Dinner │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 176 ┆ 40.55 ┆ 3.0 ┆ Male ┆ true ┆ Sun ┆ Dinner │
│ 177 ┆ 20.69 ┆ 5.0 ┆ Male ┆ false ┆ Sun ┆ Dinner │
│ 178 ┆ 20.9 ┆ 3.5 ┆ Female ┆ true ┆ Sun ┆ Dinner │
│ 179 ┆ 30.46 ┆ 2.0 ┆ Male ┆ true ┆ Sun ┆ Dinner │
│ 180 ┆ 18.15 ┆ 3.5 ┆ Female ┆ true ┆ Sun ┆ Dinner │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘
First of all, you import the Polars library into your program. It’s considered good practice to import it using the alias pl
. You then read the content of the tips.parquet
file into Polars. To do this, you use the scan_parquet()
function. This reads the file’s data into a Polars LazyFrame.
Unlike traditional DataFrames that store data, LazyFrames contain only a set of instructions—called a query plan—that defines how the data should be processed. To see the actual data, you still need to read it into a Polars DataFrame. This is called materializing the LazyFrame and is achieved using the .collect()
method.
Before a LazyFrame materializes its data, its query plan is optimized. For example, Polars can choose to only read some data from the data source if those are enough to fulfill the query. Also, when you define a LazyFrame containing multiple instructions, there are no delays while you create it because you don’t need to wait for earlier data reads to complete before adding new instructions. This makes LazyFrames the preferred approach in Polars.
The result has a shape of 180 rows and 7 columns. This is shown in the output as the shape of the LazyFrame.
Next, you need to figure out if there’s any missing data you need to deal with:
>>> (
... tips
... .null_count()
... ).collect()
shape: (1, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
│ 0 ┆ 2 ┆ 4 ┆ 0 ┆ 0 ┆ 0 ┆ 2 │
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘
To check for the presence of nulls, you use .null_count()
on your LazyFrame which adds in an instruction to find a count of the nulls in each column of your data. Normally, this would require a read of the entire file. However, because a Parquet file stores a count of nulls for each column in its metadata, obtaining the counts is instantaneous.
To actually trigger the data read, you again use the LazyFrame’s .collect()
method. This will implement the optimized version of the plan contained within your LazyFrame to obtain the required data.
As you can see, the resulting columns contain counts of the various null values, this time represented by unsigned 32-bit integers (u32
). You’ll deal with these in the next section.
Take the Quiz: Test your knowledge with our interactive “How to Deal With Missing Data in Polars” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Deal With Missing Data in PolarsThis quiz will test your knowledge of working with null data in Polars. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.
How to Work With Missing Data in Polars
When you’re dealing with missing data in Polars, there are several things you can do:
- Recover it
- Remove it
- Replace it
- Ignore it
By far the best plan of action is to review the techniques used to collect the source data, and try to find out why missing data exists in the first place. Perhaps it’s been caused by a manual data entry problem, or a failure of another system to provide the values. Once you know why it’s missing, you can take measures to ensure future data collections are complete, or even repeat the collection process and recover the current missing values.
If you can’t recover the missing data, the next best approach is to remove or replace it. Be aware that removing data may cause bias in calculations, so you should only do this if this isn’t a concern. A better option might be to replace it with an alternative, more usable value.
It’s also possible to ignore missing values. This approach might be fine in cases where you’re collecting text-based data that contains customer reviews or opinions. However, missing numerical data can skew calculations, and your code may even crash if it can’t cope with null values properly.
From your initial findings, you notice that there are two rows in the total
column with nulls
, and four in the tip
column that you’ll deal with first. You decide to find out if there are any rows where there are null
values in both of these columns:
>>> import polars as pl
>>> tips = pl.scan_parquet("tips.parquet")
>>> (
... tips
... .filter(
... pl.col("total").is_null() & pl.col("tip").is_null()
... )
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 4 ┆ null ┆ null ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 18 ┆ null ┆ null ┆ Female ┆ true ┆ Fri ┆ Lunch │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘
After you import the Polars library and create a tips
LazyFrame, you add further instructions to filter out everything apart from any rows that contain a null
in both their total
and tip
columns. You still need to use .collect()
to materialize your LazyFrame into a DataFrame to see the results.
To create the filter, you use .filter()
to specify a filter context and pass in an expression to define the criteria. In this case, the expression pl.col("total").is_null() & pl.col("tip").is_null()
tells the LazyFrame which columns to analyze. The Boolean AND (&
) ensures that only those rows whose tip
and total
values are both null
are included in the result.
Note: Polars uses Python’s bitwise logical operators, &
, |
, and ~
, to do elementwise Boolean operations. This is possible thanks to Python’s support for operator overloading.
One point to note is that applying the filter()
method to tips
doesn’t actually change the original tips
LazyFrame in any way. Instead, it creates a fresh LazyFrame containing an updated set of instructions. This means several LazyFrames can be developed from the original tips
LazyFrame by applying different methods to it.
Later in the tutorial, you’ll develop this idea further by adding successive refinements to a LazyFrame by appending multiple methods to a base LazyFrame. Each new method added has the effect of tweaking the LazyFrame further. This is commonly known as method chaining and is the recommended way of manipulating LazyFrames.
As you can see from the output, the records with a record_id
of 4
and 18
meet the filter’s criteria.
After further investigation, you decide to drop these rows since they’re the only two where the total
column contains a null
value, as revealed by your earlier use of .null_count()
:
>>> (
... tips
... .drop_nulls(pl.col("total"))
... .filter(
... pl.col("total").is_null() & pl.col("tip").is_null()
... )
... ).collect()
shape: (0, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘
To drop these rows, you use the LazyFrame’s .drop_nulls()
method. This adds an instruction into your LazyFrame to drop all rows whose total
column contains null
values. Performing the drop on only these null
values works fine here because there are no other rows containing null
in their total
column. You’ll see later how to create an instruction to base the drop on the presence of nulls
in both columns.
The .filter()
method won’t be needed in the final code. You’re including it here to display the filtered data to confirm the rows have indeed been deleted. As you can see, this time the filter finds nothing.
Recall that your initial analysis revealed some nulls
in the tip
column that also have amounts in their total
column. You decide these are genuine records so decide to update tip
to contain zero for these records. This will make sure there won’t be any unnecessary bias should you need to analyze the tips later. To do this, you use .fill_null()
:
>>> (
... tips
... .drop_nulls(pl.col("total"))
... .with_columns(pl.col("tip").fill_null(0))
... .filter(pl.col("tip").is_null())
... ).collect()
shape: (0, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘
You add another instruction to your LazyFrame, this time telling it to fill any existing null
values in the tip
column with the number 0
. Then, you apply the instruction to the tip
column using the .with_columns()
context and specify your exact requirements as its parameter. You use with_columns()
because you were changing the value of the column’s data.
To verify the changes, a filter is applied once again to confirm that the tip
column no longer contains any null
values. As with your earlier filter, this one should also be removed in the final version of the code.
Now that you’ve learned the basics of removing and replacing null
values, it’s time to explore other Polars techniques for handling missing data.
Using a More Strategic Approach
In this section, you’ll deal with the null
values that exist within the time
column. As you saw earlier, there are two of them. First of all, you’ll take a look at the rows you need to deal with:
>>> import polars as pl
>>> tips = pl.scan_parquet("tips.parquet")
>>> (
... tips
... .filter(
... pl.col("time").is_null()
... )
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪══════╡
│ 3 ┆ 5.75 ┆ 1.0 ┆ Female ┆ true ┆ Fri ┆ null │
│ 15 ┆ 8.58 ┆ 1.92 ┆ Male ┆ true ┆ Fri ┆ null │
└───────────┴───────┴──────┴────────┴────────┴─────┴──────┘
Using the .filter()
method, as before, reveals that records with record_id
values of 3
and 15
contain null
values in their time
column. To deal with these, you decide to look for clues in the surrounding data:
>>> (
... tips
... .filter(
... pl.col("record_id").is_in([2, 3, 4, 14, 15, 16])
... )
... ).collect()
shape: (6, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 2 ┆ 22.49 ┆ 3.5 ┆ Male ┆ false ┆ Fri ┆ Dinner │
│ 3 ┆ 5.75 ┆ 1.0 ┆ Female ┆ true ┆ Fri ┆ null │
│ 4 ┆ null ┆ null ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 14 ┆ 13.42 ┆ 3.48 ┆ Female ┆ true ┆ Fri ┆ Lunch │
│ 15 ┆ 8.58 ┆ 1.92 ┆ Male ┆ true ┆ Fri ┆ null │
│ 16 ┆ 15.98 ┆ 3.0 ┆ Female ┆ false ┆ Fri ┆ Lunch │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘
You may have noticed that a pair of null
values have reappeared in the total
and tip
columns of record 4
. This is because the filter is applied to the original tips
LazyFrame. This problem will sort itself out when you add the new instructions to your modified LazyFrame.
What stands out this time is that the same value surrounds each null
in the time
column. You decide to fill each of these null
values with their surrounding value. To do this, you can make use of the strategy
parameter of .fill_null()
:
>>> (
... tips
... .drop_nulls("total")
... .with_columns(pl.col("tip").fill_null(0))
... .with_columns(pl.col("time").fill_null(strategy="forward"))
... .filter(pl.col("record_id").is_in([3, 15]))
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 3 ┆ 5.75 ┆ 1.0 ┆ Female ┆ true ┆ Fri ┆ Dinner │
│ 15 ┆ 8.58 ┆ 1.92 ┆ Male ┆ true ┆ Fri ┆ Lunch │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────
Here, .with_columns()
analyzes each row in the time
column. When it reaches a cell whose time
value is null
, .fill_null()
will replace the null
using a forward
strategy. This means it’ll forward-fill the null
with the previous non-null value, which will be the Dinner
value from record 2
. Similarly, record 15
will have its null
replaced with Lunch
from record 14
.
As before, so you can see the results of your work, you use a filter on these rows. The forward
strategy is just one of several strategies that are available to you.
Next, you’ll look at how to deal with null
data that exists across multiple columns.
Dealing With Nulls Across Multiple Columns
Earlier, you saw that records 4
and 18
in the original tips.parquet
file have null
values in both the total
and tip
columns. You removed them by using only the null
values in the total
column. This worked because these were the only two records with nulls in this column.
Suppose there were other rows where total
contained a null
value but tip
had a valid entry. These would have been removed as well, even though they may not have needed to be. It would be safer to remove only those records where the null
values exist in both columns. To do this, you can make use of the .all_horizontal()
function:
>>> import polars as pl
>>> tips = pl.scan_parquet("tips.parquet")
>>> (
... tips
... .filter(
... pl.all_horizontal(pl.col("total", "tip").is_null())
... )
... ).collect()
shape: (2, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 4 ┆ null ┆ null ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 18 ┆ null ┆ null ┆ Female ┆ true ┆ Fri ┆ Lunch │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘
The .all_horizontal()
function allows you to compute a Boolean AND horizontally across columns. You use it here to select rows that have nulls in both their total
and tip
columns.
By passing in both total
and tip
columns to pl.col()
and using .is_null()
, you’re effectively passing both pl.col("total").is_null()
and pl.col("tip").is_null()
as two separate parameters to .all_horizontal()
. Each parameter will yield a True
or False
value for each row depending on whether or not their total
and tip
columns contain null
values.
If both parameters are True
, .all_horizontal()
will return True
and that row will be included in the output. However, if either of these return False
, .all_horizontal()
will also return False
for those rows, meaning they will be ignored in its output. That’s why no other rows are present—there are no other rows with both a total
of null
and a tip
of null
.
You used .all_horizontal()
to produce the same two rows as before. If you want to go one step further and instruct your LazyFrame to ignore these rows, you just need to make one small change to your code:
>>> (
... tips
... .filter(
... ~pl.all_horizontal(pl.col("total", "tip").is_null())
... )
... ).collect()
shape: (178, 7)
┌───────────┬───────┬──────┬────────┬────────┬─────┬────────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ bool ┆ str ┆ str │
╞═══════════╪═══════╪══════╪════════╪════════╪═════╪════════╡
│ 1 ┆ 28.97 ┆ 3.0 ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ 2 ┆ 22.49 ┆ 3.5 ┆ Male ┆ false ┆ Fri ┆ Dinner │
│ 3 ┆ 5.75 ┆ 1.0 ┆ Female ┆ true ┆ Fri ┆ null │
│ 5 ┆ 22.75 ┆ 3.25 ┆ Female ┆ false ┆ Fri ┆ Dinner │
│ 6 ┆ 40.17 ┆ 4.73 ┆ Male ┆ true ┆ Fri ┆ Dinner │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 176 ┆ 40.55 ┆ 3.0 ┆ Male ┆ true ┆ Sun ┆ Dinner │
│ 177 ┆ 20.69 ┆ 5.0 ┆ Male ┆ false ┆ Sun ┆ Dinner │
│ 178 ┆ 20.9 ┆ 3.5 ┆ Female ┆ true ┆ Sun ┆ Dinner │
│ 179 ┆ 30.46 ┆ 2.0 ┆ Male ┆ true ┆ Sun ┆ Dinner │
│ 180 ┆ 18.15 ┆ 3.5 ┆ Female ┆ true ┆ Sun ┆ Dinner │
└───────────┴───────┴──────┴────────┴────────┴─────┴────────┘
Can you spot the difference? Take a close look at the highlighted line. You added the NOT operator (~
) to the all_horizontal()
function. This has the effect of negating it.
In the earlier version of this code, you added instructions to your LazyFrame causing it to display only two rows. This time, the NOT operator ensures that the LazyFrame displays everything except those two rows. As a result, rows where both the total
and tip
columns contain null
values are no longer included.
You could have used this code in place of the earlier version to remove these values immediately. The full version of your null-cleansing code now looks like this:
>>> import polars as pl
>>> tips = pl.scan_parquet("tips.parquet")
>>> (
... tips
... .filter(
... ~pl.all_horizontal(pl.col("total", "tip").is_null())
... )
... .with_columns(pl.col("tip").fill_null(0))
... .with_columns(pl.col("time").fill_null(strategy="forward"))
... ).null_count().collect()
shape: (1, 7)
┌───────────┬───────┬─────┬────────┬────────┬─────┬──────┐
│ record_id ┆ total ┆ tip ┆ gender ┆ smoker ┆ day ┆ time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═══════════╪═══════╪═════╪════════╪════════╪═════╪══════╡
│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │
└───────────┴───────┴─────┴────────┴────────┴─────┴──────┘
In a production environment, you’d probably use something like .collect()
to view the cleansed version of your data, or .collect().write_parquet("tips_clean.parquet")
to update your original data file with it. Here, you use .null_count().collect()
so you can be satisfied that the original null
values no longer exist.
Note: In addition to the all_horizontal()
function, which gives you AND functionality, there’s also an any_horizontal()
function that performs an OR if you need it. Similarly, in addition to .is_null()
, there’s also is_not_null()
. These offer you many permutations for dealing with null
values across multiple columns.
Congratulations. You’ve successfully dealt with all of those annoying null
values, but there are other techniques that you may find handy to know. The final method you’ll learn about for removing null
values involves using the column’s data type.
Dealing With Nulls by Column Data Type
In this section, you’ll learn about Polars selectors. Selectors allow you to select columns based not only on their names but also on their data types and other properties.
To begin with, you’ll need a new LazyFrame. The code below creates one named scientists
:
>>> import polars as pl
>>> import polars.selectors as cs
>>> scientists = pl.LazyFrame(
... {
... "scientist_id": [1, 2, 3, 4, 5],
... "first_name": ["Isaac", "Louis", None, "Charles", "Marie"],
... "last_name": [None, "Pasteur", "Einstein", "Darwin", "Curie"],
... "birth_year": [1642, 1822, None, 1809, 1867],
... "death_year": [1726, 1895, 1955, None, 1934],
... }
... )
>>> scientists.collect()
shape: (5, 6)
┌──────────────┬────────────┬───────────┬────────────┬────────────┐
│ scientist_id ┆ first_name ┆ last_name ┆ birth_year ┆ death_year │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞══════════════╪════════════╪═══════════╪════════════╪════════════╡
│ 1 ┆ Isaac ┆ null ┆ 1642 ┆ 1726 │
│ 2 ┆ Louis ┆ Pasteur ┆ 1822 ┆ 1895 │
│ 3 ┆ null ┆ Einstein ┆ null ┆ 1955 │
│ 4 ┆ Charles ┆ Darwin ┆ 1809 ┆ null │
│ 5 ┆ Marie ┆ Curie ┆ 1867 ┆ 1934 │
└──────────────┴────────────┴───────────┴────────────┴────────────┘
The LazyFrame contains details of five renowned scientists from various disciplines. While creating the LazyFrame, several null
values were included in it. To create these null
values, you used None
instead of normal data values.
Suppose you wanted to update the null
values to “Unknown” in those columns with a data type of str
, and replace null
values in the i64
columns with zero. To do this, you can use the Polars selectors:
>>> import polars.selectors as cs
>>> (
... scientists
... .with_columns(cs.string().fill_null("Unknown"))
... .with_columns(cs.integer().fill_null(0))
... ).collect()
shape: (5, 6)
┌──────────────┬────────────┬───────────┬────────────┬────────────┐
│ scientist_id ┆ first_name ┆ last_name ┆ birth_year ┆ death_year │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │
╞══════════════╪════════════╪═══════════╪════════════╪════════════╡
│ 1 ┆ Isaac ┆ Unknown ┆ 1642 ┆ 1726 │
│ 2 ┆ Louis ┆ Pasteur ┆ 1822 ┆ 1895 │
│ 3 ┆ Unknown ┆ Einstein ┆ 0 ┆ 1955 │
│ 4 ┆ Charles ┆ Darwin ┆ 1809 ┆ 0 │
│ 5 ┆ Marie ┆ Curie ┆ 1867 ┆ 1934 │
└──────────────┴────────────┴───────────┴────────────┴────────────┘
To use selectors, you must first import polars.selectors
. It’s standard practice to do so using an alias of cs
. Think of this as short for column selectors.
As before, because you’re going to change a column, you once again use .with_columns()
. This time, you’ll choose each of the string columns using the string data type selector cs.string()
, and each of the integer columns using cs.integer()
. To update the null
values within these columns you again use .fill_null()
.
In this case, you replace the nulls in each of the str
columns with the string Unknown
, and each null
in the i64
columns (or i32
columns if there were any) with a 0
. If you look closely at the output, you’ll see the first_name
and last_name
columns no longer contain any null
values. Instead, you’ve replaced them with Unknown
. Also, the null
values that were in birth_year
and death_year
have all been replaced with 0
.
Although nulls can be a nuisance, they’re not the only thing that can mess you about. You’ll learn about another annoyance you may encounter next.
Dealing With Those Pesky NaNs and Infs
Sometimes, the data you’re analyzing contains the results of calculations. If incorrect data enters a calculation, it may cause erroneous results to appear. The most common examples are non-numeric values or NaNs
, usually caused by trying to perform a numeric calculation in non-numeric data, and infinity values (inf
, -inf
), which are usually caused by a zero-division.
As with null
data, you should try and find out why the invalid data has occurred and see if you can resolve the situation. If not, you’ll probably want to convert the offending values to null
values, then use the techniques you learned earlier to remove them or replace them with something more suited to your analysis.
The code below scans the sales_trends.csv
file included in your downloads into a LazyFrame named sales_trends
:
>>> import polars as pl
>>> sales_trends = pl.scan_csv("sales_trends.csv")
>>> sales_trends.collect()
shape: (5, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ product ┆ last_year ┆ current_year ┆ next_year │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ A ┆ 17 ┆ 19 ┆ 29.0 │
│ B ┆ 35 ┆ 35 ┆ NaN │
│ C ┆ 21 ┆ 19 ┆ null │
│ D ┆ 42 ┆ 50 ┆ -inf │
│ E ┆ 23 ┆ 25 ┆ inf │
└─────────┴───────────┴──────────────┴───────────┘
This LazyFrame contains sales volumes for five products labeled A
though E
. The sales volumes for last_year
and the current_year
are complete. However, the next_year
figures show some erroneous data. Before going any further with this data, you decide to replace it with null
values:
>>> (
... sales_trends
... .with_columns(
... pl.col("next_year").replace(
... [float("inf"), -float("inf"), float("NaN")], None
... )
... )
... ).collect()
shape: (5, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ product ┆ last_year ┆ current_year ┆ next_year │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ A ┆ 17 ┆ 19 ┆ 29.0 │
│ B ┆ 35 ┆ 35 ┆ null │
│ C ┆ 21 ┆ 19 ┆ null │
│ D ┆ 42 ┆ 50 ┆ null │
│ E ┆ 23 ┆ 25 ┆ null │
└─────────┴───────────┴──────────────┴───────────┘
To replace each of the different offending data in the next_year
column, you passed a list of them as a parameter to the replace()
method. As you can see from the output, the unknown data is now set to null
.
Now that all of the erroneous data has been replaced with null
values, you can use .fill_null()
to update them all using the same logic. For example, in the absence of anything better, you decide to predict next year’s data by extrapolating it from the known data:
>>> (
... sales_trends
... .with_columns(
... pl.col("next_year").replace(
... [float("inf"), -float("inf"), float("NaN")], None
... )
... )
... .with_columns(
... pl.col("next_year").fill_null(
... pl.col("current_year")
... + (pl.col("current_year") - pl.col("last_year"))
... )
... )
... ).collect()
shape: (5, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ product ┆ last_year ┆ current_year ┆ next_year │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ A ┆ 17 ┆ 19 ┆ 29.0 │
│ B ┆ 35 ┆ 35 ┆ 35.0 │
│ C ┆ 21 ┆ 19 ┆ 17.0 │
│ D ┆ 42 ┆ 50 ┆ 58.0 │
│ E ┆ 23 ┆ 25 ┆ 27.0 │
└─────────┴───────────┴──────────────┴───────────┘
Once the erroneous data has been replaced with null
values, you add some additional instructions telling your LazyFrame how to deal with them. You replace each null
by adjusting the current_year
figure. You add it to the difference between the last_year
and current_year
sales figures to produce a next_year
estimate. Your data has now been cleansed of those pesky NaNs
and infs
.
It’s time to show off your prowess—it’s time for your review.
Practicing Your Skills
Now it’s your turn to demonstrate your skills and have some fun handling null
data in Polars. After leaving the Monty Python comedy troupe, after which the Python language is named, John Cleese wrote the successful television show Fawlty Towers. In this section, you’ll analyze some data relating to it.
The data you’ll need for this exercise is stored within the ft_exercise.parquet
file that comes with your downloadables. It contains four columns of data relating to the episodes of Fawlty Towers:
Column Name | Polars Data Type | Description |
---|---|---|
episode |
Int64 |
Episode number (1 - 12) |
series |
Int64 |
Series number (1 or 2) |
title |
String |
Episode title |
original_date |
Date |
First transmission date |
You’ve received this data, but quickly notice that it contains some null
values. You use the code shown below to reveal their volume:
>>> import polars as pl
>>> episodes = pl.scan_parquet("ft_exercise.parquet")
>>> episodes.null_count().collect()
shape: (1, 4)
┌─────────┬────────┬───────┬───────────────┐
│ episode ┆ series ┆ title ┆ original_date │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════════╪════════╪═══════╪═══════════════╡
│ 0 ┆ 2 ┆ 2 ┆ 1 │
└─────────┴────────┴───────┴───────────────┘
After performing some further research, you decide to replace these null
values using the following logic:
- There are two nulls in the
series
column for episodes6
and7
. Knowing that there were twelve episodes split evenly over two series, and that the data is in series order, see if you can replace the twonull
values with their correct data. - Two episode
title
values are missing. Episode4
is namedThe Hotel Inspectors
, and episode9
is namedWaldorf Salad
. Replace these nulls with the correct data. - Episode
2
is missing itsoriginal_date
value. See if you can work out how to insert the correct date using the existing information in the table. You may need to research this one.
You may also need to use your investigative skills to search the documentation for clues. Remember, that’s what it’s there for. Once you think you’ve solved each part, you might like to write it out to a CSV file to check that you’re on the right track. You can do this by calling .collect().write_csv("solution.csv")
on your LazyFrame.
You’ll find one possible solution in the dropdown below. There’s also an ft_exercise_solution.csv
file that contains the corrected records should you need some hints.
One possible solution is shown below:
>>> import polars as pl
>>> episodes = pl.scan_parquet("ft_exercise.parquet")
>>> (
... episodes
... .with_columns(
... pl.when(pl.col("episode") == 6)
... .then(pl.col("series").fill_null(strategy="forward"))
... .otherwise(pl.col("series").fill_null(strategy="backward"))
... )
... .with_columns(
... pl.when(pl.col("episode") == 4)
... .then(pl.col("title").fill_null("The Hotel Inspectors"))
... .otherwise(pl.col("title").fill_null("Waldorf Salad"))
... )
... .with_columns(
... pl.col("original_date").interpolate()
... )
... ).null_count().collect()
shape: (1, 4)
┌─────────┬────────┬───────┬───────────────┐
│ episode ┆ series ┆ title ┆ original_date │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞═════════╪════════╪═══════╪═══════════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0 │
└─────────┴────────┴───────┴───────────────┘
The when-then-otherwise
expression is used to apply different fills using conditional logic similar to that provided by the if
conditional statement when used with an else
clause.
For the series
column, the forward
strategy is used to fill the null
for episode 6
. This means the previous non-null value of 1
is used as a replacement. A backward
strategy is used for the other series null
value relating to episode 7
, meaning the next non-null value of 2
is used as a replacement. Given that there were two series of six episodes each, these are sensible options.
For the episode
column, the null
relating to episode 4
is replaced with The Hotel Inspectors
, while Waldorf Salad
replaces the null
of episode 9
. A little research will show these to be the correct episode titles.
Did you manage to get the missing date correct? Well done if you did. The update was made by selecting the original_date
column and calling its .interpolate()
method. This uses linear interpolation to estimate the missing values based on the surrounding values. Incidentally, if you thought it should be one of the strategy
parameters, you’re not the only one.
Well done if you managed to complete that exercise—and indeed, well done for making it this far!
Conclusion
In this tutorial, you’ve learned various techniques for removing or replacing null
values in Polars.
You also learned that while there are a range of ways for dealing with null
values, the best way is to avoid them in the first place. In other words, see if you can find out what caused them and take steps to prevent this from happening in the future. You should also see if you can figure out what’s missing and try to recover it.
Congratulations on completing this tutorial. Enjoy applying these newfound skills to better prepare your LazyFrames for analysis.
Take the Quiz: Test your knowledge with our interactive “How to Deal With Missing Data in Polars” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Deal With Missing Data in PolarsThis quiz will test your knowledge of working with null data in Polars. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.
Get Your Code: Click here to download the free sample code that shows you how to deal with missing data in Polars.
Frequently Asked Questions
Now that you have some experience with handling missing data in Polars, you can use the questions and answers below to check your understanding and recap what you’ve learned.
These FAQs are related to the most important concepts you’ve covered in this tutorial. Click the Show/Hide toggle beside each question to reveal the answer.
You can check for null
values in Polars using the .null_count()
method, which provides a count of nulls in each column of your dataset.
NaN
represents non-numeric values resulting from invalid calculations, while null
indicates missing data within your dataset.
You replace NaN
in Polars by converting them to nulls and then using the .fill_null()
method to replace them with more suitable values.
You can fix missing data by identifying null
values using .null_count()
, then either replace them with .fill_null()
, remove them with .drop_nulls()
, or use other strategies to handle them.
To handle missing data, recover it if possible, replace it with a suitable value, or remove it if it doesn’t cause bias or issues in your analysis.