Understanding Split-Combine-Apply
00:00
In the previous lesson, I gave a first demonstration of the groupby()
method. In this lesson, I’m going to dive a little deeper into how that lazy evaluation of a split-apply-combine operation takes place.
00:12
You’ve already seen the setup in the previous lesson, so I’m just going to dive back into the REPL and show you the step-by-step parts of a groupby
.
00:21 Like before, I’ve grabbed the DataFrame from the legislator’s module and like before, I’m going to group by state.
00:32
This time, I’ve stored it away so that I can introspect on it. Just as a reminder, groupby()
returns a DataFrameGroup
By
object, no calculations have been done yet.
00:43
Instead of invoking an apply function like count()
, like I did in the previous lesson, I’m going to iterate on the groupby()
object itself.
00:55
Iterating on the by_state
object returns two things, the thing being grouped upon and a frame. For each of these, I’m going to print out some info.
01:11
This first line will act as a demarcation header printing the name of the state. The exclamation mark in the f-string means to use the repr()
version rather than the string version of a value.
01:24 In this case, that just results in the name of the state being inside of quotes. The
01:33
second print shows the first two entries in the frame, which is the group. I’ve used the end
argument to print to double space between the iterations.
01:43 Okay, let’s make this happen. Yep, that’s a lot. Let me scroll back here.
01:53 Each iteration is a grouping by state. The frame contains each row in the DataFrame that belongs to this grouping. As there’s a lot of data I’m only showing the first two values in the frame. For Arkansas that’s representatives, Waskey and Cale.
02:10 Back down to the bottom,
02:15
the DataFrameGroupBy object has a
groups` attribute, which you can use to get at the contents of each one of those frames. Here I looked at the Pennsylvania group and it shows that there are a whole bunch of things inside of it.
02:31 Note that what it returned is an index. The DataFrame was auto-indexed, so the 4, 19 and 21, etc., correspond to the equivalent rows in the CSV. There are actually 1053 items in this index.
02:48
You can see that by the len()
property on the end there. So pandas prints out the first 10, then dot dot dot, then the last 10. If you want more than just the index value, you can use, use the get_group()
method instead.
03:05
Once again, I’m looking at the Pennsylvania data, but this time you can see the whole row. Note how the first column of data here is the index into the DataFrame and the values correspond to those shown in the groups
attribute above.
03:18 Like before, it’s a summary showing the first and last five. Instead of showing the length at the bottom, this shows the shape, the shape being the number of rows and columns.
03:29
This call is essentially doing a .loc()
operation, filtering the state column for PA. Let me show you that.
03:41
See, same results. If you only want the PA data and none of the other groups, this would be the faster way to do it. Let’s dig in a bit more by examining the first iteration of that for
loop.
03:53
Instead of using for
, I’m going to iterate on the by_state
object and call next()
to get the first iteration. This is actually what the for
loop does behind the covers.
04:07
Now, state and frame contain the same things they did on the first iteration through the for
loop from before. The first state is Arkansas and the frame contains a bunch of stuff, so I’ve only printed the first three, but there are Waskey and Cale, which you may remember from the for
loop this time joined by Grigsby.
04:29 The frame itself is a DataFrame so you can get at its parts. These are all the Congress members from Arkansas in our data. So far, all that has been done is splitting.
04:41 So let’s apply the aggregate.
04:48
Calling count()
on the frame counts the number of things inside of it. Think back to the previous lesson where I did a groupby
state accessing the last_name column then aggregating with count()
.
04:59 That’s what I’ve just deconstructed. Now, all that’s left to do is the same to the other 49 states, and you’ve got your data ready to be combined.
05:16 So you’ve come full circle. There you have it. The first count in the grouping is Arkansas 16. So far you’ve only grouped on columns. pandas allows you to do much more than that though. That’s next.
Become a Member to join the conversation.