Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: pandas GroupBy: Grouping Real World Data in Python
Whether you’ve just started working with pandas and want to master one of its core capabilities, or you’re looking to fill in some gaps in your understanding about .groupby()
, this tutorial will help you to break down and visualize a pandas GroupBy operation from start to finish.
This tutorial is meant to complement the official pandas documentation and the pandas Cookbook, where you’ll see self-contained, bite-sized examples. Here, however, you’ll focus on three more involved walkthroughs that use real-world datasets.
In this tutorial, you’ll cover:
- How to use pandas GroupBy operations on real-world data
- How the split-apply-combine chain of operations works
- How to decompose the split-apply-combine chain into steps
- How to categorize methods of a pandas GroupBy object based on their intent and result
This tutorial assumes that you have some experience with pandas itself, including how to read CSV files into memory as pandas objects with read_csv()
. If you need a refresher, then check out Reading CSVs With pandas and pandas: How to Read and Write Files.
You can download the source code for all the examples in this tutorial by clicking on the link below:
Download Datasets: Click here to download the datasets that you’ll use to learn about pandas’ GroupBy in this tutorial.
Prerequisites
Before you proceed, make sure that you have the latest version of pandas available within a new virtual environment:
In this tutorial, you’ll focus on three datasets:
- The U.S. Congress dataset contains public information on historical members of Congress and illustrates several fundamental capabilities of
.groupby()
. - The air quality dataset contains periodic gas sensor readings. This will allow you to work with floats and time series data.
- The news aggregator dataset holds metadata on several hundred thousand news articles. You’ll be working with strings and doing text munging with
.groupby()
.
You can download the source code for all the examples in this tutorial by clicking on the link below:
Download Datasets: Click here to download the datasets that you’ll use to learn about pandas’ GroupBy in this tutorial.
Once you’ve downloaded the .zip
file, unzip the file to a folder called groupby-data/
in your current directory. Before you read on, ensure that your directory tree looks like this:
./
│
└── groupby-data/
│
├── legislators-historical.csv
├── airqual.csv
└── news.csv
With pandas
installed, your virtual environment activated, and the datasets downloaded, you’re ready to jump in!
Example 1: U.S. Congress Dataset
You’ll jump right into things by dissecting a dataset of historical members of Congress. You can read the CSV file into a pandas DataFrame
with read_csv()
:
# pandas_legislators.py
import pandas as pd
dtypes = {
"first_name": "category",
"gender": "category",
"type": "category",
"state": "category",
"party": "category",
}
df = pd.read_csv(
"groupby-data/legislators-historical.csv",
dtype=dtypes,
usecols=list(dtypes) + ["birthday", "last_name"],
parse_dates=["birthday"]
)
The dataset contains members’ first and last names, birthday, gender, type ("rep"
for House of Representatives or "sen"
for Senate), U.S. state, and political party. You can use df.tail()
to view the last few rows of the dataset:
>>> from pandas_legislators import df
>>> df.tail()
last_name first_name birthday gender type state party
11970 Garrett Thomas 1972-03-27 M rep VA Republican
11971 Handel Karen 1962-04-18 F rep GA Republican
11972 Jones Brenda 1959-10-24 F rep MI Democrat
11973 Marino Tom 1952-08-15 M rep PA Republican
11974 Jones Walter 1943-02-10 M rep NC Republican
The DataFrame
uses categorical dtypes for space efficiency:
>>> df.dtypes
last_name object
first_name category
birthday datetime64[ns]
gender category
type category
state category
party category
dtype: object
You can see that most columns of the dataset have the type category
, which reduces the memory load on your machine.
The Hello, World!
of pandas GroupBy
Now that you’re familiar with the dataset, you’ll start with a Hello, World!
for the pandas GroupBy operation. What is the count of Congressional members, on a state-by-state basis, over the entire history of the dataset? In SQL, you could find this answer with a SELECT
statement:
SELECT state, count(name)
FROM df
GROUP BY state
ORDER BY state;
Here’s the near-equivalent in pandas:
>>> n_by_state = df.groupby("state")["last_name"].count()
>>> n_by_state.head(10)
state
AK 16
AL 206
AR 117
AS 2
AZ 48
CA 361
CO 90
CT 240
DC 2
DE 97
Name: last_name, dtype: int64
You call .groupby()
and pass the name of the column that you want to group on, which is "state"
. Then, you use ["last_name"]
to specify the columns on which you want to perform the actual aggregation.
You can pass a lot more than just a single column name to .groupby()
as the first argument. You can also specify any of the following:
- A
list
of multiple column names - A
dict
or pandasSeries
- A NumPy array or pandas
Index
, or an array-like iterable of these
Here’s an example of grouping jointly on two columns, which finds the count of Congressional members broken out by state and then by gender:
>>> df.groupby(["state", "gender"])["last_name"].count()
state gender
AK F 0
M 16
AL F 3
M 203
AR F 5
...
WI M 196
WV F 1
M 119
WY F 2
M 38
Name: last_name, Length: 116, dtype: int64
The analogous SQL query would look like this:
SELECT state, gender, count(name)
FROM df
GROUP BY state, gender
ORDER BY state, gender;
As you’ll see next, .groupby()
and the comparable SQL statements are close cousins, but they’re often not functionally identical.
pandas GroupBy vs SQL
This is a good time to introduce one prominent difference between the pandas GroupBy operation and the SQL query above. The result set of the SQL query contains three columns:
state
gender
count
In the pandas version, the grouped-on columns are pushed into the MultiIndex
of the resulting Series
by default:
>>> n_by_state_gender = df.groupby(["state", "gender"])["last_name"].count()
>>> type(n_by_state_gender)
<class 'pandas.core.series.Series'>
>>> n_by_state_gender.index[:5]
MultiIndex([('AK', 'M'),
('AL', 'F'),
('AL', 'M'),
('AR', 'F'),
('AR', 'M')],
names=['state', 'gender'])
To more closely emulate the SQL result and push the grouped-on columns back into columns in the result, you can use as_index=False
:
>>> df.groupby(["state", "gender"], as_index=False)["last_name"].count()
state gender last_name
0 AK F 0
1 AK M 16
2 AL F 3
3 AL M 203
4 AR F 5
.. ... ... ...
111 WI M 196
112 WV F 1
113 WV M 119
114 WY F 2
115 WY M 38
[116 rows x 3 columns]
This produces a DataFrame
with three columns and a RangeIndex
, rather than a Series
with a MultiIndex
. In short, using as_index=False
will make your result more closely mimic the default SQL output for a similar operation.
Note: In df.groupby(["state", "gender"])["last_name"].count()
, you could also use .size()
instead of .count()
, since you know that there are no NaN
last names. Using .count()
excludes NaN
values, while .size()
includes everything, NaN
or not.
Also note that the SQL queries above explicitly use ORDER BY
, whereas .groupby()
does not. That’s because .groupby()
does this by default through its parameter sort
, which is True
unless you tell it otherwise:
>>> # Don't sort results by the sort keys
>>> df.groupby("state", sort=False)["last_name"].count()
state
DE 97
VA 432
SC 251
MD 305
PA 1053
...
AK 16
PI 13
VI 4
GU 4
AS 2
Name: last_name, dtype: int64
Next, you’ll dive into the object that .groupby()
actually produces.
How pandas GroupBy Works
Before you get any further into the details, take a step back to look at .groupby()
itself:
>>> by_state = df.groupby("state")
>>> print(by_state)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>
What is DataFrameGroupBy
? Its .__str__()
value that the print function shows doesn’t give you much information about what it actually is or how it works. The reason that a DataFrameGroupBy
object can be difficult to wrap your head around is that it’s lazy in nature. It doesn’t really do any operations to produce a useful result until you tell it to.
Note: In this tutorial, the generic term pandas GroupBy object refers to both DataFrameGroupBy
and SeriesGroupBy
objects, which have a lot in common.
One term that’s frequently used alongside .groupby()
is split-apply-combine. This refers to a chain of three steps:
- Split a table into groups.
- Apply some operations to each of those smaller tables.
- Combine the results.
It can be difficult to inspect df.groupby("state")
because it does virtually none of these things until you do something with the resulting object. A pandas GroupBy object delays virtually every part of the split-apply-combine process until you invoke a method on it.
So, how can you mentally separate the split, apply, and combine stages if you can’t see any of them happening in isolation? One useful way to inspect a pandas GroupBy object and see the splitting in action is to iterate over it:
>>> for state, frame in by_state:
... print(f"First 2 entries for {state!r}")
... print("------------------------")
... print(frame.head(2), end="\n\n")
...
First 2 entries for 'AK'
------------------------
last_name first_name birthday gender type state party
6619 Waskey Frank 1875-04-20 M rep AK Democrat
6647 Cale Thomas 1848-09-17 M rep AK Independent
First 2 entries for 'AL'
------------------------
last_name first_name birthday gender type state party
912 Crowell John 1780-09-18 M rep AL Republican
991 Walker John 1783-08-12 M sen AL Republican
If you’re working on a challenging aggregation problem, then iterating over the pandas GroupBy object can be a great way to visualize the split part of split-apply-combine.
There are a few other methods and properties that let you look into the individual groups and their splits. The .groups
attribute will give you a dictionary of {group name: group label}
pairs. For example, by_state.groups
is a dict
with states as keys. Here’s the value for the "PA"
key:
>>> by_state.groups["PA"]
Int64Index([ 4, 19, 21, 27, 38, 57, 69, 76, 84,
88,
...
11842, 11866, 11875, 11877, 11887, 11891, 11932, 11945, 11959,
11973],
dtype='int64', length=1053)
Each value is a sequence of the index locations for the rows belonging to that particular group. In the output above, 4, 19, and 21 are the first indices in df
at which the state equals "PA"
.
You can also use .get_group()
as a way to drill down to the sub-table from a single group:
>>> by_state.get_group("PA")
last_name first_name birthday gender type state party
4 Clymer George 1739-03-16 M rep PA NaN
19 Maclay William 1737-07-20 M sen PA Anti-Administration
21 Morris Robert 1734-01-20 M sen PA Pro-Administration
27 Wynkoop Henry 1737-03-02 M rep PA NaN
38 Jacobs Israel 1726-06-09 M rep PA NaN
... ... ... ... ... ... ... ...
11891 Brady Robert 1945-04-07 M rep PA Democrat
11932 Shuster Bill 1961-01-10 M rep PA Republican
11945 Rothfus Keith 1962-04-25 M rep PA Republican
11959 Costello Ryan 1976-09-07 M rep PA Republican
11973 Marino Tom 1952-08-15 M rep PA Republican
This is virtually equivalent to using .loc[]
. You could get the same output with something like df.loc[df["state"] == "PA"]
.
It’s also worth mentioning that .groupby()
does do some, but not all, of the splitting work by building a Grouping
class instance for each key that you pass. However, many of the methods of the BaseGrouper
class that holds these groupings are called lazily rather than at .__init__()
, and many also use a cached property design.
Next, what about the apply part? You can think of this step of the process as applying the same operation (or callable) to every sub-table that the splitting stage produces.
From the pandas GroupBy object by_state
, you can grab the initial U.S. state and DataFrame
with next()
. When you iterate over a pandas GroupBy object, you’ll get pairs that you can unpack into two variables:
>>> state, frame = next(iter(by_state)) # First tuple from iterator
>>> state
'AK'
>>> frame.head(3)
last_name first_name birthday gender type state party
6619 Waskey Frank 1875-04-20 M rep AK Democrat
6647 Cale Thomas 1848-09-17 M rep AK Independent
7442 Grigsby George 1874-12-02 M rep AK NaN
Now, think back to your original, full operation:
>>> df.groupby("state")["last_name"].count()
state
AK 16
AL 206
AR 117
AS 2
AZ 48
...
The apply stage, when applied to your single, subsetted DataFrame
, would look like this:
>>> frame["last_name"].count() # Count for state == 'AK'
16
You can see that the result, 16, matches the value for AK
in the combined result.
The last step, combine, takes the results of all of the applied operations on all of the sub-tables and combines them back together in an intuitive way.
Read on to explore more examples of the split-apply-combine process.
Example 2: Air Quality Dataset
The air quality dataset contains hourly readings from a gas sensor device in Italy. Missing values are denoted with -200 in the CSV file. You can use read_csv()
to combine two columns into a timestamp while using a subset of the other columns:
# pandas_airqual.py
import pandas as pd
df = pd.read_csv(
"groupby-data/airqual.csv",
parse_dates=[["Date", "Time"]],
na_values=[-200],
usecols=["Date", "Time", "CO(GT)", "T", "RH", "AH"]
).rename(
columns={
"CO(GT)": "co",
"Date_Time": "tstamp",
"T": "temp_c",
"RH": "rel_hum",
"AH": "abs_hum",
}
).set_index("tstamp")
This produces a DataFrame
with a DatetimeIndex
and four float
columns:
>>> from pandas_airqual import df
>>> df.head()
co temp_c rel_hum abs_hum
tstamp
2004-03-10 18:00:00 2.6 13.6 48.9 0.758
2004-03-10 19:00:00 2.0 13.3 47.7 0.726
2004-03-10 20:00:00 2.2 11.9 54.0 0.750
2004-03-10 21:00:00 2.2 11.0 60.0 0.787
2004-03-10 22:00:00 1.6 11.2 59.6 0.789
Here, co
is that hour’s average carbon monoxide reading, while temp_c
, rel_hum
, and abs_hum
are the average Celsius temperature, relative humidity, and absolute humidity over that hour, respectively. The observations run from March 2004 through April 2005:
>>> df.index.min()
Timestamp('2004-03-10 18:00:00')
>>> df.index.max()
Timestamp('2005-04-04 14:00:00')
So far, you’ve grouped on columns by specifying their names as str
, such as df.groupby("state")
. But .groupby()
is a whole lot more flexible than this! You’ll see how next.
Grouping on Derived Arrays
Earlier you saw that the first parameter to .groupby()
can accept several different arguments:
- A column or list of columns
- A
dict
or pandasSeries
- A NumPy array or pandas
Index
, or an array-like iterable of these
You can take advantage of the last option in order to group by the day of the week. Use the index’s .day_name()
to produce a pandas Index
of strings. Here are the first ten observations:
>>> day_names = df.index.day_name()
>>> type(day_names)
<class 'pandas.core.indexes.base.Index'>
>>> day_names[:10]
Index(['Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',
'Wednesday', 'Thursday', 'Thursday', 'Thursday', 'Thursday'],
dtype='object', name='tstamp')
You can then take this object and use it as the .groupby()
key. In pandas, day_names
is array-like. It’s a one-dimensional sequence of labels.
Note: For a pandas Series
, rather than an Index
, you’ll need the .dt
accessor to get access to methods like .day_name()
. If ser
is your Series
, then you’d need ser.dt.day_name()
.
Now, pass that object to .groupby()
to find the average carbon monoxide (co
) reading by day of the week:
>>> df.groupby(day_names)["co"].mean()
tstamp
Friday 2.543
Monday 2.017
Saturday 1.861
Sunday 1.438
Thursday 2.456
Tuesday 2.382
Wednesday 2.401
Name: co, dtype: float64
The split-apply-combine process behaves largely the same as before, except that the splitting this time is done on an artificially created column. This column doesn’t exist in the DataFrame itself, but rather is derived from it.
What if you wanted to group not just by day of the week, but by hour of the day? That result should have 7 * 24 = 168
observations. To accomplish that, you can pass a list of array-like objects. In this case, you’ll pass pandas Int64Index
objects:
>>> hr = df.index.hour
>>> df.groupby([day_names, hr])["co"].mean().rename_axis(["dow", "hr"])
dow hr
Friday 0 1.936
1 1.609
2 1.172
3 0.887
4 0.823
...
Wednesday 19 4.147
20 3.845
21 2.898
22 2.102
23 1.938
Name: co, Length: 168, dtype: float64
Here’s one more similar case that uses .cut()
to bin the temperature values into discrete intervals:
>>> import pandas as pd
>>> bins = pd.cut(df["temp_c"], bins=3, labels=("cool", "warm", "hot"))
>>> df[["rel_hum", "abs_hum"]].groupby(bins).agg(["mean", "median"])
rel_hum abs_hum
mean median mean median
temp_c
cool 57.651 59.2 0.666 0.658
warm 49.383 49.3 1.183 1.145P
hot 24.994 24.1 1.293 1.274
In this case, bins
is actually a Series
:
>>> type(bins)
<class 'pandas.core.series.Series'>
>>> bins.head()
tstamp
2004-03-10 18:00:00 cool
2004-03-10 19:00:00 cool
2004-03-10 20:00:00 cool
2004-03-10 21:00:00 cool
2004-03-10 22:00:00 cool
Name: temp_c, dtype: category
Categories (3, object): [cool < warm < hot]
Whether it’s a Series
, NumPy array, or list doesn’t matter. What’s important is that bins
still serves as a sequence of labels, comprising cool
, warm
, and hot
. If you really wanted to, then you could also use a Categorical
array or even a plain old list
:
- Native Python list:
df.groupby(bins.tolist())
- pandas
Categorical
array:df.groupby(bins.values)
As you can see, .groupby()
is smart and can handle a lot of different input types. Any of these would produce the same result because all of them function as a sequence of labels on which to perform the grouping and splitting.
Resampling
You’ve grouped df
by the day of the week with df.groupby(day_names)["co"].mean()
. Now consider something different. What if you wanted to group by an observation’s year and quarter? Here’s one way to accomplish that:
>>> # See an easier alternative below
>>> df.groupby([df.index.year, df.index.quarter])["co"].agg(
... ["max", "min"]
... ).rename_axis(["year", "quarter"])
max min
year quarter
2004 1 8.1 0.3
2 7.3 0.1
3 7.5 0.1
4 11.9 0.1
2005 1 8.7 0.1
2 5.0 0.3
This whole operation can, alternatively, be expressed through resampling. One of the uses of resampling is as a time-based groupby. All that you need to do is pass a frequency string, such as "Q"
for "quarterly"
, and pandas will do the rest:
>>> df.resample("Q")["co"].agg(["max", "min"])
max min
tstamp
2004-03-31 8.1 0.3
2004-06-30 7.3 0.1
2004-09-30 7.5 0.1
2004-12-31 11.9 0.1
2005-03-31 8.7 0.1
2005-06-30 5.0 0.3
Often, when you use .resample()
you can express time-based grouping operations in a much more succinct manner. The result may be a tiny bit different than the more verbose .groupby()
equivalent, but you’ll often find that .resample()
gives you exactly what you’re looking for.
Example 3: News Aggregator Dataset
Now you’ll work with the third and final dataset, which holds metadata on several hundred thousand news articles and groups them into topic clusters:
# pandas_news.py
import pandas as pd
def parse_millisecond_timestamp(ts):
"""Convert ms since Unix epoch to UTC datetime instance."""
return pd.to_datetime(ts, unit="ms")
df = pd.read_csv(
"groupby-data/news.csv",
sep="\t",
header=None,
index_col=0,
names=["title", "url", "outlet", "category", "cluster", "host", "tstamp"],
parse_dates=["tstamp"],
date_parser=parse_millisecond_timestamp,
dtype={
"outlet": "category",
"category": "category",
"cluster": "category",
"host": "category",
},
)
To read the data into memory with the proper dtype
, you need a helper function to parse the timestamp column. This is because it’s expressed as the number of milliseconds since the Unix epoch, rather than fractional seconds. If you want to learn more about working with time in Python, check out Using Python datetime to Work With Dates and Times.
Similar to what you did before, you can use the categorical dtype
to efficiently encode columns that have a relatively small number of unique values relative to the column length.
Each row of the dataset contains the title, URL, publishing outlet’s name, and domain, as well as the publication timestamp. cluster
is a random ID for the topic cluster to which an article belongs. category
is the news category and contains the following options:
b
for businesst
for science and technologye
for entertainmentm
for health
Here’s the first row:
>>> from pandas_news import df
>>> df.iloc[0]
title Fed official says weak data caused by weather,...
url http://www.latimes.com/business/money/la-fi-mo...
outlet Los Angeles Times
category b
cluster ddUyU0VZz0BRneMioxUPQVP6sIxvM
host www.latimes.com
tstamp 2014-03-10 16:52:50.698000
Name: 1, dtype: object
Now that you’ve gotten a glimpse of the data, you can begin to ask more complex questions about it.
Using Lambda Functions in .groupby()
This dataset invites a lot more potentially involved questions. Here’s a random but meaningful one: which outlets talk most about the Federal Reserve? Assume for simplicity that this entails searching for case-sensitive mentions of "Fed"
. Bear in mind that this may generate some false positives with terms like "Federal government"
.
To count mentions by outlet, you can call .groupby()
on the outlet, and then quite literally .apply()
a function on each group using a Python lambda function:
>>> df.groupby("outlet", sort=False)["title"].apply(
... lambda ser: ser.str.contains("Fed").sum()
... ).nlargest(10)
outlet
Reuters 161
NASDAQ 103
Businessweek 93
Investing.com 66
Wall Street Journal \(blog\) 61
MarketWatch 56
Moneynews 55
Bloomberg 53
GlobalPost 51
Economic Times 44
Name: title, dtype: int64
Let’s break this down since there are several method calls made in succession. Like before, you can pull out the first group and its corresponding pandas object by taking the first tuple
from the pandas GroupBy iterator:
>>> title, ser = next(iter(df.groupby("outlet", sort=False)["title"]))
>>> title
'Los Angeles Times'
>>> ser.head()
1 Fed official says weak data caused by weather,...
486 Stocks fall on discouraging news from Asia
1124 Clues to Genghis Khan's rise, written in the r...
1146 Elephants distinguish human voices by sex, age...
1237 Honda splits Acura into its own division to re...
Name: title, dtype: object
In this case, ser
is a pandas Series
rather than a DataFrame
. That’s because you followed up the .groupby()
call with ["title"]
. This effectively selects that single column from each sub-table.
Next comes .str.contains("Fed")
. This returns a Boolean Series
that’s True
when an article title registers a match on the search. Sure enough, the first row starts with "Fed official says weak data caused by weather,..."
and lights up as True
:
>>> ser.str.contains("Fed")
1 True
486 False
1124 False
1146 False
1237 False
...
421547 False
421584 False
421972 False
422226 False
422905 False
Name: title, Length: 1976, dtype: bool
The next step is to .sum()
this Series
. Since bool
is technically just a specialized type of int
, you can sum a Series
of True
and False
just as you would sum a sequence of 1
and 0
:
>>> ser.str.contains("Fed").sum()
17
The result is the number of mentions of "Fed"
by the Los Angeles Times in the dataset. The same routine gets applied for Reuters, NASDAQ, Businessweek, and the rest of the lot.
Improving the Performance of .groupby()
Now backtrack again to .groupby().apply()
to see why this pattern can be suboptimal. To get some background information, check out How to Speed Up Your pandas Projects. What may happen with .apply()
is that it’ll effectively perform a Python loop over each group. While the .groupby().apply()
pattern can provide some flexibility, it can also inhibit pandas from otherwise using its Cython-based optimizations.
All that is to say that whenever you find yourself thinking about using .apply()
, ask yourself if there’s a way to express the operation in a vectorized way. In that case, you can take advantage of the fact that .groupby()
accepts not just one or more column names, but also many array-like structures:
- A one-dimensional NumPy array
- A list
- A pandas
Series
orIndex
Also note that .groupby()
is a valid instance method for a Series
, not just a DataFrame
, so you can essentially invert the splitting logic. With that in mind, you can first construct a Series
of Booleans that indicate whether or not the title contains "Fed"
:
>>> mentions_fed = df["title"].str.contains("Fed")
>>> type(mentions_fed)
<class 'pandas.core.series.Series'>
Now, .groupby()
is also a method of Series
, so you can group one Series
on another:
>>> import numpy as np
>>> mentions_fed.groupby(
... df["outlet"], sort=False
... ).sum().nlargest(10).astype(np.uintc)
outlet
Reuters 161
NASDAQ 103
Businessweek 93
Investing.com 66
Wall Street Journal \(blog\) 61
MarketWatch 56
Moneynews 55
Bloomberg 53
GlobalPost 51
Economic Times 44
Name: title, dtype: uint32
The two Series
don’t need to be columns of the same DataFrame
object. They just need to be of the same shape:
>>> mentions_fed.shape
(422419,)
>>> df["outlet"].shape
(422419,)
Finally, you can cast the result back to an unsigned integer with np.uintc
if you’re determined to get the most compact result possible. Here’s a head-to-head comparison of the two versions that’ll produce the same result:
# pandas_news_performance.py
import timeit
import numpy as np
from pandas_news import df
def test_apply():
"""Version 1: using `.apply()`"""
df.groupby("outlet", sort=False)["title"].apply(
lambda ser: ser.str.contains("Fed").sum()
).nlargest(10)
def test_vectorization():
"""Version 2: using vectorization"""
mentions_fed = df["title"].str.contains("Fed")
mentions_fed.groupby(
df["outlet"], sort=False
).sum().nlargest(10).astype(np.uintc)
print(f"Version 1: {timeit.timeit(test_apply, number=3)}")
print(f"Version 2: {timeit.timeit(test_vectorization, number=3)}")
You use the timeit
module to estimate the running time of both versions. If you want to learn more about testing the performance of your code, then Python Timer Functions: Three Ways to Monitor Your Code is worth a read.
Now, run the script to see how both versions perform:
(venv) $ python pandas_news_performance.py
Version 1: 2.5422707499965327
Version 2: 0.3260433749965159
When run three times, the test_apply()
function takes 2.54 seconds, while test_vectorization()
takes just 0.33 seconds. This is an impressive difference in CPU time for a few hundred thousand rows. Consider how dramatic the difference becomes when your dataset grows to a few million rows!
Note: This example glazes over a few details in the data for the sake of simplicity. Namely, the search term "Fed"
might also find mentions of things like "Federal government"
.
Series.str.contains()
also takes a compiled regular expression as an argument if you want to get fancy and use an expression involving a negative lookahead.
You may also want to count not just the raw number of mentions, but the proportion of mentions relative to all articles that a news outlet produced.
pandas GroupBy: Putting It All Together
If you call dir()
on a pandas GroupBy object, then you’ll see enough methods there to make your head spin! It can be hard to keep track of all of the functionality of a pandas GroupBy object. One way to clear the fog is to compartmentalize the different methods into what they do and how they behave.
Broadly, methods of a pandas GroupBy object fall into a handful of categories:
-
Aggregation methods (also called reduction methods) combine many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of ten numbers, where the result is just a single number.
-
Filter methods come back to you with a subset of the original
DataFrame
. This most commonly means using.filter()
to drop entire groups based on some comparative statistic about that group and its sub-table. It also makes sense to include under this definition a number of methods that exclude particular rows from each group. -
Transformation methods return a
DataFrame
with the same shape and indices as the original, but with different values. With both aggregation and filter methods, the resultingDataFrame
will commonly be smaller in size than the inputDataFrame
. This is not true of a transformation, which transforms individual values themselves but retains the shape of the originalDataFrame
. -
Meta methods are less concerned with the original object on which you called
.groupby()
, and more focused on giving you high-level information such as the number of groups and the indices of those groups. -
Plotting methods mimic the API of plotting for a pandas
Series
orDataFrame
, but typically break the output into multiple subplots.
The official documentation has its own explanation of these categories. They are, to some degree, open to interpretation, and this tutorial might diverge in slight ways in classifying which method falls where.
There are a few methods of pandas GroupBy objects that don’t fall nicely into the categories above. These methods usually produce an intermediate object that’s not a DataFrame
or Series
. For instance, df.groupby().rolling()
produces a RollingGroupby
object, which you can then call aggregation, filter, or transformation methods on.
If you want to dive in deeper, then the API documentations for DataFrame.groupby()
, DataFrame.resample()
, and pandas.Grouper
are resources for exploring methods and objects.
There’s also yet another separate table in the pandas docs with its own classification scheme. Pick whichever works for you and seems most intuitive!
Conclusion
In this tutorial, you’ve covered a ton of ground on .groupby()
, including its design, its API, and how to chain methods together to get data into a structure that suits your purpose.
You’ve learned:
- How to use pandas GroupBy operations on real-world data
- How the split-apply-combine chain of operations works and how you can decompose it into steps
- How methods of a pandas GroupBy object can be categorized based on their intent and result
There’s much more to .groupby()
than you can cover in one tutorial. But hopefully this tutorial was a good starting point for further exploration!
You can download the source code for all the examples in this tutorial by clicking on the link below:
Download Datasets: Click here to download the datasets that you’ll use to learn about pandas’ GroupBy in this tutorial.
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: pandas GroupBy: Grouping Real World Data in Python