Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Polars Expressions and Contexts

00:00 In the previous lesson, I introduced you to Polars DataFrames. In this lesson, I’ll get you started doing operations on your data. An expression in Polars is one or more operations that you can chain together to execute on your DataFrame.

00:15 This can be thought of as a mini-language or DSL. Expressions are encapsulated in the Expr object, which itself can contain more Expr objects.

00:26 An expression can be stored in a variable on its own, but it isn’t particularly useful until you act on some data. The data which you act upon is called a context, which I’ll get to in a second.

00:38 The big advantage of how Polars chains expressions together is it can optimize them before executing them. For example, if you want to perform an operation conditionally, it’s likely to be faster to determine the subset of data than perform the operation rather than perform it on all the data and then call it afterwards.

00:56 Polars is really, really good at this, and in fact, there’s a future lesson just on this topic. To evaluate an expression, you need a data space within the DataFrame to operate upon.

01:08 This could be the whole DataFrame, but often isn’t. The most common ways of getting a context are the with_columns() call, which you already saw.

01:16 When you added a new column in the previous lesson, you were combining a DataFrame with a series and returning a new DataFrame, essentially using a context to create a new larger context.

01:28 select() chooses a subset of columns to operate on while filter() chooses a subset of rows. Sometimes you want to perform operations on interrelated rows, for example, counting the number of customers from each city in your database.

01:43 To do this, you call .group_by() to group the subsets together and then chain that with .agg(), which is short for aggregate. The .agg() call takes one or more operations to perform on the groups, like counting the number of things inside of it. Off to the REPL to run your first expressions. Above is the DataFrame from our previous lesson. I’ll be using it to demonstrate some expressions.

02:09 Before executing an expression, let’s start by just constructing one.

02:15 A col() short for column is an expression. Note that on its own, it’s kind of meaningless. It’s sort of like a tag. Although our DataFrame has a column named name, this column is independent of it at the moment.

02:29 I use this expression within the select() context to return a subset of the DataFrame.

02:39 Here, I’ve selected just the name column. Note that what comes back is a single column DataFrame. This is subtly different from a Series.

02:48 You can see this is a DataFrame because the shape has two values: 4, 1, where you’ll remember this Series in the previous lesson only had a single dimension.

02:57 Selecting a column is so common that the .select() method actually allows a shortcut.

03:06 Using a string containing the name of a column in select() gives you that column. I showed you the .col() version first though, because you can get a lot fancier when you use it.

03:17 Let’s create some more complex expressions.

03:27 Here, I’ve done a couple of things. First, on the right-hand side, I’m referencing a column named height_m. Then I multiply that by 3.28. Named arguments to the .select() define a column.

03:41 The result is a new DataFrame containing the heights of the buildings in feet. There are 3.28 feet in a meter. Let’s break this down a bit.

03:55 This is the same thing I passed into .select(), multiplying a .col() results in a new expression object. I can also store that away.

04:07 Then I can use that inside the .select().

04:15 This is the same result as before, but I’ve used a stored expression instead of it being in line. This means you can reuse your expressions in your code or create them dynamically if you need to. Remember, .select() returns a new DataFrame.

04:30 If I want to combine the feet data into the original frame, I could explicitly select all the existing columns, or I can use the with_columns() method instead.

04:44 You can think of with_columns() as a shortcut for .select(), so you don’t have to select all your existing columns and add a new one.

04:53 Remember when I said, remember that .select() returns a new DataFrame? Well, all operations do that. The with_columns() method did not modify the existing buildings DataFrame.

05:03 It only output a new one.

05:15 There’s no way in Polars to do inline editing, but like I’ve done here, you can always overwrite your existing variable.

05:26 With apologies to Christopher Walken, this DataFrame needs more cowbell.

Become a Member to join the conversation.