Grouping and Aggregation
00:00 In the previous lesson, I showed you how to play with larger datasets and introduced filtering. In this lesson, I’ll show you how to read in external data and do grouping and aggregation calls.
00:11 Up until now, all the calculations you’ve done have been across the entire dataset. Sometimes though, you want to perform calculations on groups of data within your dataset.
00:21 For example, counting how many cars there are of each color. To do this takes two steps. First, you need to group the subset together, and then second, you perform aggregate calculations on the subset.
00:34 If you’re familiar with SQL, this is similar to group by operations. Let’s go play with these.
00:45 Been there, done that, and this time around I’m going to use an external dataset, a CSV file containing historical data about the elected members of the House and Senate in the US.
00:57 If you’re coding along, I’ve provided the file in the sample code available in the supporting material dropdown just below this video. Polars comes with a whole bunch of functions for reading data in a variety of formats.
01:10
I’m going to use read_csv()
to read a CSV file, but there are equivalents for reading from Excel, JSON, databases, and much more. The
01:26
read_csv()
call takes a large number of optional arguments to control how the data gets read in. Here, the only two arguments I’m using are the name of the file and a list of the columns in the file that I’m interested in.
01:39 The CSV I’m using has over 30 columns, but I don’t want all of it, so the read allows me to pare that down. The result is a DataFrame as you might expect. This time, I’ve got 11,975 rows of data, still small enough to fit in memory, but getting more like a real-world situation.
02:00 This data has both the House and the Senate in it. If I want to count the number of senators and the number of members of Congress, I need to group by the type column.
02:13
The group_by()
call returns a group_by object, which on its own isn’t helpful. What you need to do is perform an operation on the groups. You do this with a chained call to .agg()
.
02:25
.agg()
takes an expression that specifies what operation to perform.
02:35
Here, I’ve used count()
. Until writing this line of code, I didn’t realize that had been deprecated.
02:46 And there’s the new equivalent in its non-deprecated format. Both cases give you a DataFrame containing a count of the number of reps and the number of senators.
02:56
The data type of the count call is u32
, that means unsigned integer 32. As counting always results in a positive number, it’s choosing to use an unsigned int which means bigger than zero, and in this case, 32 bits.
03:11 Let’s say you want to count just those who were born in 1776. That’s a filter operation. So first you have to filter. Polars calls can be quite lengthy, so it’s often a good idea to type in bits at a time and make sure they work before chaining on other stuff and proceeding.
03:28
Let’s start by doing the filter first. The birthday column has a year, month, and day. If I want to filter on just the year, I need to use the .dt
attribute of the column, which contains date operations.
03:41
Then I use the .year
method on that, which I’ll compare to 1776.
03:53
Well, that’s no fun. What happened here? The exception is telling me that the birthday column is a string, not a date. So when I tried to use .dt
, it blew up.
04:04 Yep, it’s right. Birthday’s a string. I can’t do date comparisons on it, except the data is a date. So to proceed, you can change the data into another format.
04:16
Remember when I said read_csv()
takes a lot of arguments? Well, the correct use of one of them could have prevented this fiasco. Setting try_parse_dates=True
would’ve caused Polars to convert the birthday column to date data.
04:29 But then you wouldn’t have learned about casting, which is how you can fix this issue.
04:44
Using .cast()
in an expression causes its type to change. Of course, this will only work if Polars can do the requested conversion, but seeing as our birthday date is actually a date, no problem.
04:55 It didn’t store it away. Let me store it away.
05:03 Now that I’ve stored the casted version, let’s try that filter again.
05:13 And there you go. 38 of our Congress creators were born in 1776. Let’s do some data processing on these gentlemen. There were no ladies in office back then.
05:34
I’ve started with the same filter and then initiated a group_by()
on the state column. Now I can do multiple aggregations at once.
05:53 find the earliest birthday in the state within our filter,
06:02 and then the latest. Our resulting DataFrame shows how many are in each state, and the youngest and oldest of those born in the year 1776. You’ve seen the basics of applying expressions to your DataFrames.
Become a Member to join the conversation.