Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

pandas GroupBy: Grouping Real World Data in Python (Summary)

In this course, you’ve covered a ton of ground on .groupby(), including its design, its API, and how to chain methods together to get data into a structure that suits your purpose.

You’ve learned:

  • How to use pandas GroupBy operations on real-world data
  • How the split-apply-combine chain of operations works and how you can decompose it into steps
  • How methods of a pandas GroupBy object can be categorized based on their intent and result

There’s much more to .groupby() than you can cover in one course. But hopefully this was a good starting point for further exploration!

You can download the source code for all the examples in this course by clicking on the link below:

Download

Sample Code (.zip)

28.3 MB
Download

Course Slides (.pdf)

3.3 MB

00:00 In the previous lesson, I showed you the consequence of leaving pandas-land and embracing the snake. In short, Python will squeeze the life out of your performance.

00:09 This final lesson is a summary of the course and includes some future material you might be interested in.

00:15 pandas is a powerful tool for dealing with table-based data in Python. A table in pandas lives inside an object called a DataFrame. Since a DataFrame is a table structure, it has rows and columns where the rows have an index you use to refer to them, and the columns have names.

00:35 A Series is a special kind of sequence in pandas that has a listing of data along with a corresponding index. The DataFrame is comprised of a set of Series objects, one for each column.

00:48 Although this course gave a refresher on pandas basics, it was really about grouping data for processing. A group by operation uses a split-apply-combine pattern where the split chunks up the data, the apply performs an action on the chunks and pandas combines the results into a DataFrame or Series.

01:08 pandas provides lots of functions for the apply stage, including functions for aggregating data like counts and sums, functions for transforming data and functions to filter the result removing groups that aren’t of interest. the split part of the pattern is performed by the .group by() method on a DataFrame object.

01:30 This method returns a DataFrameGroupBy object, which contains the methods for the apply stage of the operation. For example, the count method counts the items inside a grouping.

01:42 The combined part of the process is done for you with pandas returning a DataFrame or Series object containing the results.

01:50 The most common way of grouping data is by finding matching values in a column. For example, all the states in a country, that’s not the only way though. You can also group on a set of columns or on any Series that has the same shape as the DataFrame being grouped.

02:06 One of the methods on the DataFrame group by object is the generic .apply() method, which takes a lambda. If the built-in apply methods aren’t enough for you, you can write your own.

02:17 Of course, using a lambda means leaving the low-level world of pandas and popping up into the higher level abstraction of Python. This has performance consequences. As much as possible, try to do your data manipulation at the pandas level if you can.

02:33 The pandas documentation is a bit of a mixed bag. It is very detailed, but if you’re newer to pandas, you might find some of it opaque. These are three URLs all with the same host name.

02:45 The first is the general documentation, The second is the split-apply-combine user guide, and the third is the actual GroupBy API call. If you want to learn even more about GroupBy, especially all the different things you can do at the apply stage, the split-apply-combine guide is probably the place you want to start.

03:05 It includes a comprehensive list of all the aggregation, transformation, and filtering methods, and as it’s a guide rather than API docs, it’s easier to read than the third item there.

03:17 For a deeper dive into pandas in general, this is a good place to start. It is available as both a written tutorial and a video course. Note that for SEO reasons, the tutorial and courses have slightly different names.

03:28 Don’t worry if it doesn’t quite match what I’ve written here. For some practice with pandas, this guide shows you how to use the library in a real-world scenario.

03:38 pandas integrates well with Matplotlib. In fact, you can do some graphing things as part of the split-apply-combine pattern. To learn more about data visualization with pandas, see this tutorial or course.

03:50 And finally, for a deeper dive, this tutorial and course covers interesting little tidbits that can help you do better data processing.

03:59 That’s all for me. My 50 little pandas say thank you for your attention. I hope you enjoyed the course.

Become a Member to join the conversation.