Introducing GroupBy

pandas GroupBy: Grouping Real World Data in Python Christopher Trudeau 11:02

Transcript
Discussion (1)

00:00 In the previous lesson, I showed you the split-apply combine pattern that pandas uses when doing a GroupBy operation. In this lesson, I’ll show you some actual code that does a GroupBy.

00:12 In the review lessons, I cobbled together a few lines of data. From here on in I’m going to use some actual data files with real meat in them. All of the data files I use in the course are available along with the source code in the Supporting Material dropdown in the notes below this video. My first example GroupBy, I’ll call it the “Hello world” version, uses a dataset that contains info on historic US Congress members.

00:37 This slide shows a little sample. It’s been chopped in a couple ways. It doesn’t show all the columns and it definitely doesn’t show all the rows. Like with most real world data, it’s dirty.

00:48 Being a Canadian who’s somewhat naive of American history, I would expect the party column to have things like Democrat and Republican. It does, but it also has things like Anti-Administration and some are just plain old empty.

01:00 Depending on what you plan on doing with the data, that may or may not be a problem. The CSV file that I’m using has 33 columns and almost 12,000 rows. I’m going to demonstrate primarily in the REPL, but first I’m going to write a short program that imports the CSV into a DataFrame.

01:18 I could do that in the REPL as well, but it’s easier to highlight certain parts of it if I do it as code. Let’s go look at that importer.

01:27 I’m inside legislators.py. My intent here is to create a DataFrame based on the US Congress member’s CSV file and then I’ll import that DataFrame, object into the REPL and play with it.

01:39 To start off, I import pandas and as is the norm, I alias it to pd. Next, I create a dict to contain information on the types of data that I’ll be using.

01:49 You don’t have to do this, but pandas has a data type called category, which when used looks for commonality. Without specifying a data type of category, gender would end up as a series of Python objects, but by specifying that gender is a category, pandas will whittle it down to just the info in that column.

02:08 This can make a big difference on performance, especially with something like gender, which only has a small number of values.

02:16 Reading data from a CSV file is so common that pandas includes a utility for doing it and sensibly that’s named read_csv. The first argument is the file name of the data file to be read in and then I pass in the dtypes structure to tell it to replace the default data types of the objects that are being read with the ones that I specified in the dict above. And because I’m not interested in all of the columns, I’m specifying which of those I want in the DataFrame.

02:46 I want the things named in the dtype dict as well as the birthday and last_name columns, giving me a total of seven in the DataFrame.

02:54 And since birthdate is a date, rather than store it as a Python object, I want it converted to a pandas timestamp. Using the parse_dates argument, I tell pandas to try to do something with the date data in that column.

03:07 The CSV stores the birthdays as year, month day format, which is panda’s default expectation, so I really don’t have to tell it anything else. I mentioned in the review portion that a pandas timestamp is restricted.

03:20 The minimum allowed date is 1677. Since the US didn’t get a Congress until almost a hundred years after that, our dates can be parsed without a problem.

03:31 Okay, now that I’ve got this code, I can go off to the REPL and then import the DataFrame from there.

03:39 Importing, let’s take a look at the data. And there are a few things to look at here. First, the index for each row is auto-generated using an integer counter.

03:54 Second, empty values for a category remember, party was specified as category data type get listed as `NaN. No, that’s not tasty Indian bread, it’s short for “not a number”.

04:06 Yeah, I know, naan the bread is spelled with two As, but stop nit-picking my silly jokes. Speaking of data types, here you can see that the date parsing happened.

04:16 The birthday column is date time 64 with nanosecond resolution. Alright, now what you came here for. I’m gonna start grouping by calling the group by method on the DataFrame.

04:28 I’m going to group the Congress members by the state that they represent. The output here’s a bit messy and hard to read. There really are only two lines.

04:40 Let’s start at the bottom, which is the DataFrame GroupBy object. Remember, groupby is lazy and it returns this object not giving you a result DataFrame until you’ve performed the apply step of the split-apply -combine. The lines above are return result, print out a warning message and this two has made a bunch of changes and is planning on doing even more.

05:04 This message says that the default value for the observed argument to groupby is being changed. You can silence this warning by explicitly setting the value.

05:19 I find the idea that we have a new default, you better provide it as an argument or will nag, you a little funny. Eventually this warning will go away and here on the screen is what the same call looks like without the warning.

05:33 As I’m happy with the default, I’m not gonna muddy things from here on providing the observed argument. I will also through the magic of video, no longer show you this warning.

05:42 If you are coding along, you may see it in a few places that aren’t on the screen. Don’t be concerned. The DataFrame has a lot of stuff in it and when grouping you might not want to see it all.

05:53 You can reference a column in the grouping just like you do with the DataFrame itself.

06:02 Note that what came back here is a Series rather than a DataFrame. That’s because I specified a column.

06:09 Alright, that’s the splitting, both grouping on state and using only a subset of the response. Now let’s aggregate and count the size of each group.

06:24 Yeah, that’s a lot. That’s ‘cause there’s 50 states. Let me scroll back here. Like I showed in the previous lesson, there are three parts here calling groupby does a split, which I’m further chopping by specifying the last_name column, then the count() method is an aggregator in the apply operation.

06:41 The combine is done by pandas giving you this output.

06:46 I’ve got 50 lines of output for the 50 states in the database. The output shows the name of the group, which is the state and the count of Congress members for each state.

06:58 Back down at the bottom here you see what this result consists of and it shows you that it used the last name and the count is an integer. If I hadn’t included the last name value for the grouping, I would get count on each of the columns, so there would be a count for last name, a count for first name, a count for birthday, gender type and party, all the other columns in the DataFrame.

07:19 As some of the data is missing, you’d get different counts for each of those columns. For example, Massachusetts has 426 last names, 426 first names, but only 419 birthdays and 410 parties.

07:35 The reason I picked last_name to slice the columns was I know from looking at the data that that value is filled in for every row. Seeing I only want to count the number of Congress members per state picking a field that is always populated does the job. Let me do that again.

07:51 This time I’m going to store the result away so I can use it later.

08:00 As I’ve only asked for one column back, I’m actually getting a Series rather than a DataFrame, but like with DataFrames, I can call .head() on a series.

08:12 Note that the returned results were sorted. This is default behavior, but you can turn it off.

08:25 You might want it off if it isn’t important and you’ve got a lot of data, so not sorting would save you some time.

08:32 You can call .groupby() on more than one column at a time. To do that, you pass a list of column names,

08:50 The result here, shows what looks like two pieces of data for each group. Count

08:55 of each gender for each state. Let’s stick this in a result object,

09:06 and before I go to the next step, what kind of object do you think n_by_state_gender is?

09:18 Were you right? My guess would’ve been DataFrame because I’ve got a gender and a count and that’s like two columns, but that’s not actually what pandas does when you GroupBy with multiple columns.

09:30 Instead it’s creating a compound index. In this case, each indice is a combined state gender value. You can examine this by looking at the index.

09:45 The printout of the grouping above where it had the state and then two genders is a little tricky because it hides the second instance of the state. Doing this makes the table slightly easier to read, but it’s hiding the fact that the index of the second row is actually the compound Arkansas male, rather than being two different columns.

10:05 You see the result of this compounding in the multi-index object, with Arkansas and female, and Arkansas and male being separate tuples. But what, I hear you say, if you don’t wanna to do this, well they as_index argument allows you to change that behavior.

10:32 By providing as_index=False you instead get back a DataFrame. Note that this means the index is not compound and it also means it isn’t the state or the gender because those are columns.

10:44 Now the index is auto-generated and thus you get the incrementing integer on the left-hand side.

10:53 Now that you’ve seen the .groupby() call, let’s do a deeper dive into what pandas is doing with the split-apply -combine operation.

acamposla on Nov. 30, 2024

great lesson, thanks

Become a Member to join the conversation.