Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Working With LazyFrames

Christopher Trudeau

Working With Python Polars Christopher Trudeau 06:05

Transcript
Discussion

00:00 In the previous lesson, I showed you how to read CSV files and how to perform aggregate calculations. In this lesson, I’ll show you the ultimate Polars optimization trick, lazy evaluation.

00:12 Earlier I talked about filtering data before performing operations on it and how it can speed up your evaluation. Polars takes that even further if you wish, allowing you to chain expressions even to the data reading.

00:26 To do this, you use the DataFrame’s cousin, a LazyFrame. You still use the same contexts and expressions, but this time you chain them to the data read, meaning not all the data has to be in memory to perform the operation.

00:41 This can result in higher evaluation speeds and the ability to deal with larger datasets. Reading files this way is called scanning, and like with the regular read, scan supports a whole whack of formats.

00:54 Each call is named scan_ similar to the read equivalents. One important difference though, remember the columns argument to read_csv().

01:04 Well, scanning doesn’t support that, but seeing as you’re doing lazy evaluation anyway, you can get the same result by chaining a select() call one last time into the REPL.

01:14 Let’s go scan some stuff.

01:18 Importing a polar bear through customs was never this easy.

01:31 And there’s the scan equivalent of read_csv(). This time around I used the try_parse_dates, so I don’t have to do any of that pesky date casting.

01:41 Now let’s build a query. I’m using parentheses so that I can chain calls on separate lines for readability. First comes the frame. Instead of a DataFrame this is a LazyFrame object, which I got returned from the scan_csv().

02:00 Next, I select those columns I’m interested in. I’ve also added one new thing here, this .sort() expression on the state column. Does exactly what you think it might.

02:13 Then like in the last lesson, I’m filtering on birth dates from the year 1776.

02:22 Then also filtering on senators, and now I’m going to do some calculations grouped by state,

02:33 counting them, excuse me, finding their length, on second thought, I think I’ll stick to calling it counting.

02:43 Earliest birthday by state,

02:50 latest, then closing out the aggregate call, and closing the query. This is almost identical to the work done in the previous lesson, just with the additional filter by type and of course all of this is lazy.

03:04 No evaluation has happened yet. To see what Polars is going to do, you can call the .explain() method. This is a little hard to read. Notice the slash n inside.

03:15 I’m going to switch to print() instead.

03:20 Still a bit of an eyeful, but a little better. I won’t expect you to understand this if you don’t expect me to understand this, but picking through it, you can see all the bits and pieces from our query.

03:33 Sharp eyes might notice the Pi symbol that has nothing to do with 3.14, but it’s the Greek letter which has meaning in relational algebra, which is what is underlying all this fancy work.

03:45 To actually execute the query, you call its .collect() method,

03:54 and there’s the result. It’s kind of underwhelming seeing as this is just the same kind of data as the previous lesson until you think about what actually happened here.

04:04 Because of our filtering, Polars is able to throw out every line in the file that wasn’t a senator born in 1776. That means the actual aggregation calculations were only done on eight rows. Without lazy evaluation, you’re reading all 11,975 rows into memory.

04:23 With lazy evaluation, the rows have to be read in to be analyzed, but then they don’t have to be kept. This is a hell of an optimization and is probably Polars’ strongest feature.

04:34 It’s why I’ve switched to it from pandas personally.

04:39 Polars also lets you create a graphical representation of the information shown in the explain() call that I demonstrated in the REPL, but it requires Matplotlib and Graphviz to be installed.

04:52 Matplotlib is Python, so that’s just another pip install away. Unfortunately, Graphviz is not Python and so you’ve got to download and install it.

05:00 There are versions for Linux, Windows, and Mac, but it does mean extra stuff on your box to use this feature. And this is an example of the output. Personally, I’m not clear that it’s worth it as once the query gets big, it starts to put ellipses in the boxes so you can’t see everything anyways, so if I really want to know what’s going on, I’m going to stick with explain().

05:26 Polars integrates with other data science-y tools as well. You can convert to and from NumPy with the appropriately named functions. This allows you to go from a Polars DataFrame to a NumPy array and back again.

05:40 Additionally, most of the NumPy functions are supported, meaning you can use them in conjunction with your expressions. Likewise, you can also convert to and from pandas, allowing you to switch back and forth between either library’s DataFrame to do your work. Well, that’s the course. Last up, I’ll summarize what I’ve covered and point you at some places to get more information.

Become a Member to join the conversation.