Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Applying With Lambdas

Christopher Trudeau

pandas GroupBy: Grouping Real World Data in Python Christopher Trudeau 05:28

Transcript
Discussion

00:00 In the previous lesson, I showed you how to do a variety of derived groupings with time series data. In this lesson, I’ll show you how to do derived groupings with lambdas.

00:10 New lesson, new set of data to play with. This time, the dataset is from a web scraper that gathered information from news sites. For a variety’s sake, this data is messy, but in a different way than the other data was messy. The CSV file doesn’t have a header and the time information is stored as a Unix epoch number.

00:30 That’s the number of milliseconds since January 1st, 1970. Aren’t dates and times fun? And by the way, this technically isn’t a CSV file. It’s a TSV, that’s a tab-separated values file.

00:43 Even if the creator named it .csv. I’m going to stick with my approach from before. I’ll write a short program to create the DataFrame, then import that into the to play with it.

00:55 Here I’m grabbing the data from the CSV file. This argument changes the separation character from a comma to a tab. header=None tells pandas that this file has no header line and since the file already contains a value that can be used as an index, I’m telling pandas which column to use as the index value.

01:16 Since there is no header in the file, the names argument gets used to give the columns names. You’ve seen the dtypes argument before.

01:26 Several of the columns here have repeating content, so this switches them to be categories for efficiency. And finally, that pesky Unix epoch the to_datetime parser knows how to parse these.

01:39 The units argument says what time units to use. ms tells it to expect a millisecond resolution Unix epoch value. Now that I’ve got the DataFrame, let me go to the REPL,

01:54 import it and here it is. Lots of dot dot dots going on here. There are too many columns to show and the title column is too wide. And of course, this is showing the first five and the last five rows, so dot, dot, dot cubed.

02:12 It’s a good habit to double-check the data types after you’ve created a DataFrame to make sure that they are what you expect them to be. In this case, I wanted to check that some of the columns were categories and that the timestamp is actually a datetime.

02:26 Let’s look at a single row. I’m going to use .iloc to do that.

02:32 Remember, .iloc uses Python style indices, so .iloc[0] gives the first row. The actual index value for this row is one, so that’s not confusing at all.

02:44 Alright, that’s the data. Let’s do some grouping. So far you’ve only used built-in methods to do the apply stage. The general apply method allows you to write your own lambda for the apply operation.

02:56 Say you wanted to know which news outlets mentioned the Federal Reserve the most.

03:07 Grouping on outlet using the title column, then calling apply.

03:18 The lambda provided to apply gets called with the value from the series. This one checks if the title contains the word “fed”, and if so, applies sum() to the subgrouping.

03:28 Of course, this might give some false positives, for example, Roger Federer, but it’s good enough for our purposes.

03:38 Then last, I chain the result to nlargest, which shows the 10 biggest counts of “Fed” mentioned by outlet. Not surprisingly, Reuters, which is one of the largest contributors to the data, also has the most mentions of “Fed”.

03:53 There was a lot going on in that call. Let’s go through it stepwise.

04:06 Remember doing this with the Congress critters? I’m doing the grouping iterating on the result using next() to get the first group. And that group has two things, the title being grouped upon and the series with the subgroup.

04:18 The first group is the LA Times, and here are the first five article titles in that group. The lambda I used before did a comparison, so let’s do that next on this series.

04:36 contains() returns a Boolean so the result here is the index paired with True or False, where True means the title had “fed” in it and False means it didn’t.

04:46 Since the series contains Booleans, sum() counts the number of True items.

04:54 This works because it’s casting the value to a number and Booleans cast to one for True and zero for False. The rest of the call repeats this process for each of the outlets and then nlargest sorts it and gives the results you saw before.

05:10 Although not necessarily the easiest thing to read, pandas is pretty powerful. Think of how much Python you’d have to write to do this same calculation, and of course, it’d be significantly slower as well.

05:24 Speaking of slower, let’s talk performance.

Become a Member to join the conversation.