Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Working With Larger DataFrames

Christopher Trudeau

Working With Python Polars Christopher Trudeau 06:58

Transcript
Discussion

00:00 In the previous lesson, I showed you how to run an expression on a context to produce a new DataFrame. In this lesson, I’ll keep doing that, but with more data than a measly four rows.

00:12 So far, you’ve only been playing with toy data. It’s time to actually start doing some data processing. When you’re dealing with larger datasets correctly, choosing a context can have a big impact on performance.

00:23 If you only wanted to know how many feet tall the Shanghai Tower is, you don’t want to convert every building in your DataFrame. Selecting, filtering, and doing any grouping can mean far fewer calculations.

00:35 Let’s head back to the REPL, this time with some larger data. Importing the bear.

00:58 And here I’ve defined some data, which is a chart of the first 1000 squares and cubes. From a data processing perspective, this is still not a huge amount of data, but it’s more than our piddly four rows from before.

01:17 And there’s the new DataFrame. When dealing with larger sets, Polars doesn’t show you the whole thing when you display it in the REPL. Hence, the ... in the middle of the table here.

01:30 This is where the shape information at the top becomes handy. It tells you that there are 1000 rows, even though there are only 10 displayed on the screen.

01:39 The shape uses Python’s numeric_ format, which is a nice touch. 1_000 is actually a valid Python integer. You can put underscores in a number like you would with a comma to make it easier to read.

01:54 I guess that’s North American biased. I won’t judge you if you use periods where commas should go. That’s not true. I’ll judge you, I’ll just be quiet about it.

02:04 When dealing with larger datasets, there are some methods that can be helpful for inspecting the contents. head() shows the top five rows while tail() shows the last five rows.

02:17 Alright, let’s do some data processing.

02:27 This is similar to a .select() done in the previous lesson. I’m selecting the num column as well as calculating a new column named double, which is created by multiplying the num column by two.

02:39 Note that in this .select(), I didn’t name the first column, I just selected it, while the second column I explicitly called double.

02:48 Want to see how it blows up if I hadn’t done it that way?

02:58 When an expression includes a column, Polars defaults to using the same name with a derived column. In our case though, this time around, there were two expressions, both that operated on the num column and hence you get a duplicate error.

03:14 The default name of both of those expressions is num, so you have to rename one of them. You can do this using named arguments in the .select() like I did before, or you can also use the .alias() call as part of an expression.

03:37 This gives you the same result as before, but by calling .alias(), the name of the expression gets changed. In this particular case, it means more typing, but in other cases where you aren’t using .select(), this format means you can rename a resulting column.

03:52 You’ve seen me do simple multiplication. Polars lets you do this because it has overloaded Python’s math operations, allowing them to modify an expression.

04:01 When you multiply, you aren’t really multiplying. You’re adding a multiplication operation to the expression. This means if you want to do something like a log operation, you can’t just import math and use the functions there.

04:15 Instead, you have to use the functions built into the Polars expression mini-language.

04:24 Thankfully, most of what you need is there, including .log(). Up until now, I’ve been selecting columns. The .filter() call lets you choose a subset of rows instead.

04:40 Filter expressions work like Python comparison operations. Like with multiplication, comparison isn’t really happening here. Polars has overloaded the greater than operation to add that comparison to the expression.

04:53 If you tried to filter on something like 10 > 20, Python would convert that to False before passing it to the .filter() call.

05:01 But because you’re comparing a call expression, the correct thing happens. This is a little black magic-y, but it is how most data science libraries approach this problem.

05:11 You can chain filters with selects

05:21 or you can chain selects with filters.

05:31 Polars doesn’t care. The end result’s actually the same and under the covers, Polars is optimizing this, so it’ll do its best to perform this in the order that produces the correct result in the fastest time.

05:43 If you want to filter between two values, you can use is_between().

05:54 This saves you from having to perform multiple filters, combining less than and greater than. Note that is_between() includes the values used as arguments, which is different from, say, the range() function or slices.

06:08 You can also specify multiple values to the filter.

06:20 This one doesn’t make a heck of a lot of sense, as I could have just changed the is_ between(), but it proves my point. You can do multiple conditions.

06:36 This one is a little more realistic as it has two conditions based on two different columns, giving us all the cubes that are powers of three, where the cube root is between 10 and 50.

06:49 In the next lesson, I’ll show you how to create contexts based on subgroups of your data and do aggregate calculations across them.

Become a Member to join the conversation.