Grouping Data With itertools.groupby()

Functional Programming in Python Dan Bader 03:20

Now that you know how to use the reduce() function and Python’s defaultdict class, which is defined in the collections module, it’s time to look at some useful helpers in the itertools module, such as itertools.groupby.

In the next section of this course, you’ll learn how to do parallel programming in Python using functional programming principles and the multiprocessing module. You’ll start by taking the example data set based on an immutable data structure that you previously transformed using the built-in map() function. But this time, you’ll process the data in parallel, across multiple CPU cores using the Python multiprocessing module available in the standard library.

00:00 All right. I want to end this reducer() example with another, well, arguably more Pythonic version of what we looked at previously. You can see, I played with this a bunch because well, this here is called scientist_by_field5. I was basically trying to come up with ways to do this grouping in better and more readable ways.

00:22 Now, this is based on a dictionary expression and this kind of fits the theme that happened in the other videos in this series as well, where I showed you kind of the classical functional programming approach, and then showed you a more Pythonic version where we were often using list comprehensions or generator expressions to get to the same result, but kind of do it in a more Pythonic, more readable way.

00:43 I’m not sure if that’s the case here, like, I’m not sure if this is more readable, but you can do it. And there’s actually a helper function in Python that is the itertools.groupby() function. It does stuff like that.

00:57 It can group things by a keyfunc. So here, I’m grouping these items by their .field, and then you have to do some fiddling here to get the keys and the value set the right way.

01:14 So, I mean, arguably, this is more Pythonic because it uses a dictionary comprehension, but I’m not sure if this reads much better. But, you know, it gets around the need for the defaultdict. So, you know, I showed you a couple of ways to do it.

01:26 I’m sort of tempted actually to drop this crazy lambda expression here on you… you know what? The hell with it, I’ll just do it here. Okay. So, this is what I came up with. scientists_by_field…

01:42 has the same result and it uses a lambda function instead of a separately defined reducer() function. It also uses this dictionary merge syntax available in Python 3.4.

01:54 But, this is pretty gnarly and crazy code. I mean, it works, but when you look at this, it gets very, very arcane, so please don’t write code like that when you’re working with other people.

02:05 Sometimes it’s fun to sit down and spend some time to try and come up with, I guess, like, a single-line solution for this problem, but this is more like a fun exercise rather than something you should do in practice and in production code. But anyway, I hope this gave you a better idea of what the reduce() function could be used for and maybe also some ideas on how it could be used in more creative ways to achieve that grouping, for example, and not just for the classical examples where, you know, you have this here, where we’re adding up a bunch of values and kind of boiling it down to a single integer, or something like that.

02:43 So, I hope we achieved that. I hope you learned a bunch of things about functional programming in Python here. And at this point, you should have a pretty good understanding of what functional programming is, what the filter(), map(), and reduce() functions are—which are kind of the core primitives of functional programming—how they work in Python, and how you should probably not use them in Python, or

03:08 use them in different ways—for example, by replacing them with list comprehensions or generator expressions. Happy Pythoning, and have a good one.

andomar on April 2, 2020

The groupby example only works because your list is already sorted by field.

See “Generally, the iterable needs to already be sorted on the same key function.” docs.python.org/3.5/library/itertools.html#itertools.groupby

Chris James on April 20, 2020

It took me a little head scratching to figure out how to make the groupby version just display the names and not the whole Scientist object. This is what I came up with:

import itertools
scientists_by_field = {
    item[0]: list(x.name for x in item[1])
    for item in itertools.groupby(scientists, lambda x: x.field)
}
scientists_by_field

Because groupby returns a ‘grouper’ iterator, you can also make a dictionary of tuples like so

import itertools
scientists_by_field = {
    item[0]: tuple(x.name for x in item[1])
    for item in itertools.groupby(scientists, lambda x: x.field)
}
scientists_by_field

Igor Conrado Alves de Lima on April 26, 2020

The usage of itertools.groupby in the video is actually not correct. As @andomar pointed out, in order to use itertools.groupby the iterable should already be sorted. That’s why we don’t see Marie Curie in the physics group.

Here is the appropriate code:

import itertools

scientists_sorted_by_field = sorted(scientists, key=lambda x: x.field)
scientists_by_field = {
    item[0]: tuple(item[1])
    for item in itertools.groupby(scientists_sorted_by_field,
        lambda x: x.field)
}
scientists_by_field

This will produce the following output:

{'astronomy': (Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False),),
 'chemistry': (Scientist(name='Tu Youyou', field='chemistry', born=1930, nobel=True),
  Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True)),
 'math': (Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
  Scientist(name='Emy Noether', field='math', born=1882, nobel=False)),
 'physics': (Scientist(name='Marie Curie', field='physics', born=1867, nobel=True),
  Scientist(name='Sally Ride', field='physics', born=1951, nobel=False))}

Hope it helps.

Dan Bader RP Team on April 27, 2020

Fantastic, thank you for the clarification andomar & Igor! Really appreciate it.

Tom R on July 18, 2021

Hi Dan, you have me curious about how you are getting automatic detail of any built-in function you type in the interpreter just below the command line?

Bartosz Zaczyński RP Team on July 18, 2021

@Tom R That question comes up very often. Dan uses bpython as an alternative Python interpreter. It comes with context-aware doc strings and other amenities out-of-the-box 😊

Become a Member to join the conversation.