Grouping Options

pandas GroupBy: Grouping Real World Data in Python Christopher Trudeau 11:38

Transcript
Discussion (2)

00:00 In the previous lesson, I did a deep dive on how pandas implements the split-apply-combine pattern. In this lesson, I’m going to look at more complex ways of grouping things.

00:11 Up until now, the grouping you’ve done has been based on one or more columns in the DataFrame. There are other ways of deriving values to form the groups, which I’ll be covering here.

00:21 To do this, I’m going to use a new dataset. This one contains air quality measurements from a field in Italy. As the data is scientific, all the units are in metric.

00:32 Like with the Congress member’s data, there are more columns than you’ll need and some of the data is dirty. Two things of note. First, the date format is month day year, and the time is in its own column.

00:45 Second, some of the values for carbon monoxide content are negative. That’s not actually physically possible, and negative 200 is being used as a placeholder for missing data.

00:57 Like before, I’m gonna write a short program that creates the data frame object, then import that into the REPL to play with the data.

01:06 This is airqual.py, my short program that creates a DataFrame based on the Italian sensor data. To bend the data into the shape I want, there are a few things that I have to do. Like with the legislators program, I’m using read_csv() to create the initial DataFrame.

01:23 Also like the legislators, there is date and time information that I want to store as a pandas timestamp. The parse_dates argument does support specifying multiple columns, but this feature is deprecated, so although convenient, I’m going to do the date parsing in a different way so that you can see how it should be done in the future.

01:43 I mentioned that some of the values are a negative 200 to indicate missing data. pandas uses the NumPy, not a number indicator for missing data and converting the negative net value to NaN allows you to treat it properly. Mmmm naan, that and a nice mango lassie.

02:00 Yeah, I know I made the joke before, but it’s definitely yummier than bamboo. Like with the congressional data, I don’t want all the columns, so the usecols argument limits what will be in the DataFrame.

02:14 Alright, let’s deal with the timestamp conversion. The data is all in 2004 and 2005, so converting it to a timestamp is fine. It’s within range. I would like a single timestamp with the date and time of each measurement, but the CSV file stores the date and time as separate columns.

02:33 Fortunately, you can do operations on columns. Here I’m using .pop on the date and time columns to remove them from the DataFrame. Then I’m concating them together with a space.

02:45 This produces a column with a date-time string that the to_datetime() function can parse. As the date format isn’t in the ISO standard, I’ve used the format argument to specify that the new data is in month day two digit year space, hour, minute, second format.

03:04 Remember that to_datetime() returns a series, so the result needs to be stored into the DataFrame. I’m doing that in a new column called tstamp since I used pop to get at the date and time columns, those have already been removed from the DataFrame.

03:21 The column names in the CSV file are a bit ugly and some of them can’t be used as Python identifiers and I don’t want to have to type all those bracket things.

03:29 And besides that, I also want to show off the rename feature. Calling rename on a DataFrame allows you to rename things.

03:36 In this case, I’m calling it on the columns. Note that the default version of rename returns a new DataFrame. As that means twice the memory instead, I’m using the inplace=True argument here to tell it to operate on the existing DataFrame instead.

03:54 Up until now, I’ve done nothing with the index. The kind of data that I’m playing with here is called time series data, and that means that there is a date timestamp associated with each row.

04:05 There are interesting things you can do with this kind of data and it’s easier if you use the timestamp as the index of the DataFrame. As I didn’t specify one in the read_csv(), I got an auto-generated one.

04:17 Instead, by calling set_index, I can change the index to be the timestamp column that I created when I parsed the date and time column separately. Like with the rename, the default behavior here is to create a whole new DataFrame for efficiency.

04:33 Once again, I’m using the inplace=True argument so I don’t double the amount of memory used. Alright, all that’s ready to go, off to the REPL and I’ll use this DataFrame.

04:49 Importing it, and there it is. Note the shape. There are 9,357 rows of measurements. The default output is showing the first five and last five rows of the DataFrame.

05:04 Recall that there were rows with negative 200 as the CO2 measurement. Let me look for one of those. The 11th entry had a negative 200, which our import process converted into NaN.

05:17 I had a programmer friend named Nan once. He hated floating-point jokes. As this is time-series data let’s see what times are involved.

05:28 Since I set the timestamp as the index, you can see the earliest time in the data by calling min on the index. Same goes for max. Okay, let’s do some grouping.

05:42 I may grumble a bit about the timestamp data types restrictions, but there are some beautiful features. For example, it has a whole bunch of methods that allow you to get at information about the date.

05:54 Let’s create a sequence with the names of the days of the week for each row.

06:02 Nice, huh? No parsing, no string F time. Just call day name on the data and you’re good. And since the data it was based on was an index, so is the output.

06:14 Let’s take a look at it. Since the index has the same shape as the DataFrame, you can use this to do a grouping.

06:26 And there you go, grouping by days of the week. This feels kind of simple, but think for a second about what is going on here. You’ve created a new index based on a datetime operation on the existing index.

06:38 Then because the new index is the same shape as the original you’ve used it to perform the grouping buckets on the DataFrame. Takes me a bit to wrap my head around this.

06:47 The thing used to group the DataFrame doesn’t have to be from the DataFrame. As long as it’s the same shape, you’re fine. Let’s try that again with hours instead.

07:00 Stored in our index and that’s what it looks like. And now I can derive a new grouping based on our two derived indices, combining them.

07:17 I’m passing in a list, with the weekday and hour, I formed a multi-index grouping, giving the carbon monoxide values for each hour for each day of the week.

07:26 There’s still too much data here, so pandas is chopping it for display. Join just the first five and the last five values. Note that the column names in the result are tstamp and tstamp.

07:38 That’s not too pretty. Let’s fix it.

07:50 The rename_access call allows the result to have more meaningful names. Sometimes you want to group data into categories. For example, splitting the temperature data up into three buckets.

08:02 The panda module has a cut function to do this kind of classification. I did import pandas in the program, but I haven’t done it in the REPL, so I have to do that here first.

08:24 The cut call takes a thing to cut. That’s the temperature column in this case. And then a number of bins, which is three, and optionally you can label the bins.

08:36 The end result is a series,

08:40 and remember that a series isn’t just a sequence, it’s a sequence with an index. Since this series was derived from a column from the Data Frame, it has the same index as the DataFrame, and since it has the same index as the DataFrame,

09:03 you can use it to do groupings. There’s a bit more going on here than groupings you’ve seen in the past. First, instead of grouping on the DataFrame, I’m grouping on a set of columns, the relative and absolute humidity columns.

09:16 That’s more complicated than the DataFrame on its own, but the results still uses the same index. As such, I can group by using the derived temperature buckets series.

09:27 That’s the split part and just because it’s fun I’ve also made the apply part a bit more complicated. The agg apply function is short for aggregator and it lets you apply multiple aggregator functions to the data at a time.

09:41 Here I’m calculating both the mean and the median values.

09:46 The end result is a new DataFrame grouped using the temperature buckets, having the mean and median for each of the relative and absolute humidities. I wonder if we could get Absolute to sponsor us because I keep saying Absolute.

09:59 I’ve gone from Indian food to vodka. Maybe I should go get something to eat.

10:05 Let’s try that again, but this time with some time stuff.

10:23 It’s a lot of typing. This example’s similar to the previous one, but this time I’m operating on the whole DataFrame. I’m grouping using a derived multi-index based on the year and quarter from the timestamp.

10:35 Then using agg to get the min and max, and then for readability, I’ve renamed the columns. Because this is timestamp data, there is actually a faster way of doing what I just did.

10:47 The resample method only works on time series data and it uses the fact that there are time values to do a frequency calculation to arrive at new data. resample takes a data offset or a string representing a data offset.

11:02 QE is the data offset that means data at the end of the quarter.

11:14 You end up with almost the same result. The only difference being that the timestamp shows the date at the end of the quarter instead of being two columns, For large amounts of data, resampling will likely be significantly faster and it sure is a lot less typing.

11:31 That’s it for the time series stuff. Next up, I’ll show you how to use lambdas in the apply stage of GroupBy.

toigopaul on Nov. 19, 2024

11:02 QE is the data offset that means data at the end of the quarter.

“QE” doesn’t seem to be valid (anymore?). For me, df.resample(“Q”)… produced the table shown in the in the video.

toigopaul on Nov. 19, 2024

I should be careful with that anymore comment. I’m using Python 3.11.1.

Become a Member to join the conversation.