Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Working With LazyFrames

00:00 In the previous lesson, I showed you how to read CSV files and how to perform aggregate calculations. In this lesson, I’ll show you the ultimate Polars optimization trick, lazy evaluation.

00:12 Earlier I talked about filtering data before performing operations on it and how it can speed up your evaluation. Polars takes that even further if you wish, allowing you to chain expressions even to the data reading.

00:26 To do this, you use the DataFrame’s cousin, a LazyFrame. You still use the same contexts and expressions, but this time you chain them to the data read, meaning not all the data has to be in memory to perform the operation.

00:41 This can result in higher evaluation speeds and the ability to deal with larger datasets. Reading files this way is called scanning, and like with the regular read, scan supports a whole whack of formats.

00:54 Each call is named scan_ similar to the read equivalents. One important difference though, remember the columns argument to read_csv().

01:04 Well, scanning doesn’t support that, but seeing as you’re doing lazy evaluation anyway, you can get the same result by chaining a select() call one last time into the REPL.

01:14 Let’s go scan some stuff.

01:18 Importing a polar bear through customs was never this easy.

01:31 And there’s the scan equivalent of read_csv(). This time around I used the try_parse_dates, so I don’t have to do any of that pesky date casting.

01:41 Now let’s build a query. I’m using parentheses so that I can chain calls on separate lines for readability. First comes the frame. Instead of a DataFrame this is a LazyFrame object, which I got returned from the scan_csv().

02:00 Next, I select those columns I’m interested in.

02:06 Then like in the last lesson, I’m filtering on birth dates from the year 1776.

02:14 Then also filtering on senators, and now I’m going to do some calculations grouped by state,

02:25 counting them, excuse me, finding their length, and second thought, I think I’ll stick to calling it counting.

02:36 Earliest birthday by state,

02:43 latest, then closing out the aggregate call, and closing the query. This is almost identical to the work done in the previous lesson, just with the additional filter by type, and of course all of this is lazy.

02:57 No evaluation has happened yet. To see what Polars is going to do, you can call the .explain() method. This is a little hard to read. Notice the \n inside.

03:09 I’m gonna switch to print() instead.

03:14 Still a bit of an eyeful, but a little better. I won’t expect you to understand this if you don’t expect me to understand this, but picking through it, you can see all the bits and pieces from our query.

03:26 Sharp eyes might notice the Pi symbol: that has nothing to do with 3.14, but it’s the Greek letter which has meaning in relational algebra, which is what is underlying all this fancy work.

03:38 To actually execute the query, you call its .collect() method,

03:47 And there’s the result. It’s kind of underwhelming seeing as this is just the same kind of data as the previous lesson. Until you think about what actually happened here, because of our filtering, Polars is able to throw out every line in the file that wasn’t a senator born in 1776.

04:05 That means the actual aggregation calculations were only done on eight rows. Without lazy evaluation, you’re reading all 11,975 rows into memory. With lazy evaluation, the rows have to be read in to be analyzed, but then they don’t have to be kept.

04:22 This is a hell of an optimization and is probably Polars’ strongest feature. It’s why I’ve switched to it from Pandas personally.

04:32 Polars also lets you create a graphical representation of the information shown in the explain() call that I demonstrated in the REPL, but it requires Matplotlib and Graphviz to be installed.

04:45 Matplotlib is Python, so that’s just another pip install away. Unfortunately, Graphviz is not Python and so you’ve got to download and install it.

04:54 There are versions for Linux, Windows, and Mac, but it does mean extra stuff on your box to use this feature. And this is an example of the output. Personally, I’m not clear that it’s worth it, as once the query gets big, it starts to put ellipses in the boxes so you can’t see everything anyways, so if I really want to know what’s going on, I’m gonna stick with explain().

05:19 Polars integrates with other data science-y tools as well. You can convert to and from NumPy with the appropriately named functions. This allows you to go from a Polars DataFrame to a NumPy array and back again.

05:33 Additionally, most of the NumPy functions are supported, meaning you can use them in conjunction with your expressions. Likewise, you can also convert to and from pandas, allowing you to switch back and forth between either library’s DataFrame to do your work. Well, that’s the course. Last up, I’ll summarize what I’ve covered and point you at some places to get more information.

Avatar image for toigopaul

toigopaul on Aug. 10, 2025

It’s a bit unsettling that the 1776 senator query results don’t match mine or the supplied CSV.

shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
 state  count  min_birth   max_birth  
 ---    ---    ---         ---        
 str    u32    date        date       
╞═══════╪═══════╪════════════╪════════════╡
 MA     1      1776-12-01  1776-12-01 
 OH     2      1776-01-03  1776-07-04 
 CT     1      1776-09-15  1776-09-15 
 KY     3      1776-04-06  1776-12-08 
 NC     1      1776-10-31  1776-10-31 
└───────┴───────┴────────────┴────────────┘
Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Aug. 11, 2025

@toigopaul I’m not sure where your data came from or whether you followed the lesson closely, but I’m getting exactly the same results on my end. The rows appear in different order, but the data itself matches what’s presented.

Avatar image for toigopaul

toigopaul on Aug. 11, 2025

@Bartosz Zaczyński I got the data from Supporting Material->Sample Code (.zip)

Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Aug. 11, 2025

@toigopau Can you share your code? I’m pretty sure this must have been a pesky human error.

Avatar image for toigopaul

toigopaul on Aug. 11, 2025

import polars as pl

gov = pl.scan_csv("legislators.csv", try_parse_dates=True)
query = (
    gov
    .select("last_name", "type", "state", "birthday")
    .filter(pl.col("birthday").dt.year() == 1776)
    .filter(pl.col("type") == "sen")
    .group_by("state").agg(
        pl.len().alias("count"),
        pl.min("birthday").alias("min_birth"),
        pl.max("birthday").alias('max_birth')
    )
)
print(query.explain())
result = query.collect()
print(result)

With all due respect, it’s the data not the code as can be readily seen with an old-fashioned Excel filter. I’d like to add a sanity check by mentioning that California was added to the union in 1850. This means that whatever senator was born in 1776 would have to have been at least 74 when they first took office. Not impossible, but certainly improbable.

last_name   first_name  birthday    type    state
Bledsoe Jesse   1776-04-06  sen KY
Locke   Francis 1776-10-31  sen NC
Logan   William 1776-12-08  sen KY
Mills   Elijah  1776-12-01  sen MA
Brown   Ethan   1776-07-04  sen OH
Willey  Calvin  1776-09-15  sen CT
Bibb    George  1776-10-30  sen KY
Morris  Thomas  1776-01-03  sen OH
Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Aug. 11, 2025

@toigopaul Thank you for sharing your code. The reason why you’re getting a different result is in this line:

-.select("last_name", "type", pl.col("state").sort(), "birthday")
+.select("last_name", "type", "state", "birthday")

In short, you didn’t call .sort() on the "state" column like shown in the video lesson. However, sorting inside select() reorders only the "state" column, breaking row alignment with the other columns. This looks to me like a mistake, so I’ll pass your comment along to the course author. Thanks again for the helpful feedback!

Avatar image for toigopaul

toigopaul on Aug. 11, 2025

Wow! Thanks for that priceless insight! There’s still the mystery, to me, of how you don’t have an issue with the varying formats of the birthday column that I commented on in the previous lesson, “Grouping and Aggregation”.

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on Aug. 11, 2025

Hey @toigopaul,

Yep, @Bartosz is correct, the sort is mucking it up. Sorry about that. To sort this kind of data, the sort should have been outside of the select, which the group_by would then ruin anyway, so it shouldn’t have been there. We’ll get to work on a correction.

As for your earlier problem with the data file, I have a guess. When I opened the CSV in Excel I noticed that some values are left aligned and some right. That is usually a hint that Excel is treating them differently. When I went to apply a filter, the dates with years after 1900+ show up as little select-able trees, allowing you to select the whole year or months and days inside of them. The dates before that are showing up as full values in the selector, which makes me think Excel is treating them as text.

If I’m right, and you happened to have hit “save”, Excel would have written the things it sees as dates out in its default format for your region, whereas the “string dates” would have been left as is.

Now, why 1900 is the date boundary for Excel would be its own mystery. 1970 is a common boundary in computing, 1880 is another one, but 1900 is new to me.

Avatar image for Tappan Moore

Tappan Moore RP Team on Aug. 12, 2025

Hi @toigopaul – I’ve changed this lesson with the update that Chris Trudeau made, removing the sort call. Hopefully now others won’t suffer the same frustration.

Avatar image for toigopaul

toigopaul on Aug. 12, 2025

@Christopher Trudeau because Excel considers January 1, 1900 Day 1. I don’t know what January 0, 1900 is other than funny. Otherwise, I apologize for muddying the waters and appreciate the help and insight from the RP team. I’m all for letting the record show my misunderstandings and generally oppose censoring anything. That said, given the slight correction in the course I feel my comments and these responses could confuse future consumers of the course. Therefore, if someone makes a motion to delete all germane comments on this tutorial, I second.

-1  ########
0   1/0/1900
1   1/1/1900
2   1/2/1900
3   1/3/1900
4   1/4/1900
5   1/5/1900
6   1/6/1900
7   1/7/1900
8   1/8/1900
9   1/9/1900
10  1/10/1900

Become a Member to join the conversation.