Grouping Data
00:00 In the previous lesson, I wrapped up my review of pandas by showing you how to access different parts of a DataFrame. This lesson starts what you came here for, learning how to group data in pandas.
00:12
Consider this table of data. If you wanted to count how many people liked apples using a Python program, you would loop over the data and increment an apple counting tracker, or you might get fancier and use the Counter
object from the collections library.
00:27 If this data were in a pandas DataFrame, you could do the same thing, but it wouldn’t be the fastest way of accomplishing the task. pandas has been built for speed, which makes the slow moving fuzzy animal it’s named after a little bit ironic, most of pandas, the library that is, isn’t written in Python, but in a lower-level language that is closer to your hardware, thus giving better performance.
00:52 So instead of looping through all the rows, you could first group the rows together based on an attribute. In this example, I’m grouping the Fruit column, and then I could count either what was in each group getting all of the totals or just the ones that matched Apple.
01:07 If I preferred. The process of grouping to do calculations is called the split-apply-combine pattern. It’s a three-step process where the first step is to split your data into groups.
01:21 The second step is then to apply some function to the groups, and then finally combine the resulting data into a structure. Approaching the fruit table from before you split the rows based on the type of fruit, apples or pears, apply the count function to the number of items in each group, and then get a combined DataFrame containing the names of the fruits as indices and the fruit counts as a column.
01:49 When you perform a GroupBy split on a DataFrame, pandas uses the criteria you give it to create a series of subsets from the data.
01:58 You can split based on the name of a column, which is by far the most common way. This is like using fruit in the example I just mentioned, or you can split based on a list of column names.
02:10
Say our fruit example had another column called Variety
. This would then treat Apple Granny Smith
as one group and Apple Delicious
as another.
02:20 In fact, the column names are just shortcuts for actual columns. You can use column accessories to pass an actual column in as your grouping mechanism.
02:30
Additionally, you can pass in a function which gets called for each row, and then grouping is based on what gets returned. Or, you can pass in a Series
that maps row index labels to group names.
02:41 Later in the course, you’ll see how carefully selecting the right series, remember, that’s what all columns are, can make a performance difference on your calculation.
02:50 Once you’ve got things grouped, the applying phase performs some action on the groups.
02:58 pandas breaks the kinds of functions that you can perform in the applying stage into three types. Aggregates compute a value based on the data in a group.
03:07 This is things like counting the number of things in a group, summing the values in a group, or calculating their average. The next set of functions are transforms.
03:16 These use the values in a group to perform a calculation. For example, ‘cumulative sum’ adds together the values from ‘Groups’.
03:24
Back to the fruit example, still with the Varieties
column now let’s add a Price
column. If you grouped on fruit, you would see a row for Granny Smith with its price and then Delicious for its price, and you could cumulative sum across this.
03:39 It’s a bit of a contrived example I don’t know why you would do that with apples, but hopefully you get the idea.
03:46
The third set of functions are filters. These chop up the groups dropping. Some of them you’ve already seen head applied to the whole DataFrame. Well, you can run head
on a grouping as well giving just say the first five groups.
03:59 There are mechanisms within the grouping to determine how things are sorted. So in this case, your first five could be the five groups with the highest score.
04:09 For a full list of functions available, see the GroupBy page in the pandas docs.
04:14 Although the pattern is split-apply, then combine I don’t have much to say about combine. pandas just does it for you. The return result of the apply stage is the combined data typically as a DataFrame.
04:30
So now that you’ve got the concept, let’s see how it actually happens. In code to perform a grouping operation on a DataFrame, you use its groupby()
method.
04:40
The value you pass to the method is the splitter. In this case, I’m splitting on the “kind” column in the animals
DataFrame. This operation returns an object, more on that in a second, and then the apply functions get called on that object.
04:56
Here I’m using the sum
function, which in this case gives the total height and total weight for each of the rows whose kind column is cat
, and then for each of the rows, whose kind column is dog
.
05:09 Finally, the combined portion of the operation is done for you. When pandas return, a data frame containing the result.
05:17
As I mentioned a moment ago, calling the groupby()
method returns an object for DataFrames. This is the DataFrameGroupBy object. This object is the one that contains the apply methods.
05:29 This kind of method chaining allows for the calculation to be lazy. This means that nothing gets calculated until you go to use it. By using it, I mean accessing some of its results, or in the case of the REPL, printing it out.
05:44 Big advantage of this laziness is that you can chain things together at almost no cost, and you’re pushing all the calculations down into pandas rather than doing it at the Python level.
05:54 This keeps you speedy. Pandas is a fast little bamboo chewing biochromatic fuzzball.
Become a Member to join the conversation.