Working With groupby() in Pandas
In this lesson you’ll meet the groupby()
method, grouping rows by column values for you.
00:00
Next, we’re going to be covering grouping by, and .groupby()
is a very strong tool that you can use to split up datasets and work on them individually, and then aggregate them again, or you can use it to look at specific subsets of the data and look to see how particular things perform while one thing is consistent. So, for example, with Kevin Durant’s data, we can see how it performs against various teams.
00:23
So, taking a look at the data from the original CSV import, we have a column called Opp
(opponent). So we can take a look at how many games he plays against each opponent and how he does against those opponents.
00:34
So, the number of field goals attempted, field goals scored—all sorts of data. What we’re going to do is use the .groupby()
, so we’re going to take our data and we’re going to group by our 'Opp'
column, which has a three-letter code that designates for each team, so we’re going to have that assigned to group_by_opp'
.
00:56
And then to see how many times each particular team is played against by Kevin Durant, all we have to do is something called .size()
, and that would give us the number of games each team has played.
01:06
So Atlanta has played two, Boston has played two, Denver has played four, Dallas has played four—and those are probably the teams that are in its league, which they play against most frequently. Now, if you wanted to figure out how many shots he took against a particular team across the entire season, you’d simply change this to .sum()
, which would summarize all the values for a particular team. So as you can see, he’s scored 20 times against Atlanta over 40 attempts.
01:30 These percentage values have gotten all screwed up because percentages are not additive, so all those values are screwed, but all the values which are counts—such as offensive rebounds, defensive rebounds—all make sense and aggregate well across a particular team, in this particular dataset.
01:45 If you want to see how Kevin Durant performs against a particular team, all you do is group by the team and then aggregate all the sums for each of the columns, which is perfect.
01:54
Let’s just double-check to see if we could do this. So what we’re going to do is we’re going to take our original dataset, so we’re going to go data.Opp == 'ATL'
.
02:09
That is our function that’ll define which to select against, so we’re going to simply go data[]
,
02:16
and we’re going to go select anything with the opponent as "ATL"
. We’re going to then print those values out onto the screen, so we’re simply just going to go and return that dataset.
02:25 And that should give us the two times that Atlanta played against Kevin Durant, which is in game 2 and game 24. And we can see that he scored 7 field goals and then scored 14 field goals in the second game, which adds up correctly to the values that we’ve found here, which are 21 and 40, respectively.
02:43 So, you can see that this is a excellent way to go about collecting data. This is another way to slice a particular dataset out so you can get the smaller chunks and then do your aggregation yourself.
02:56
Now, let’s see if we can graph these field goals attempted and field goals scored for each of the specific teams. We’re going to go ahead and slice this out from our original dataset, so what we’re going to do here is we’re going to go field_goal_per_team
, we’re going to call that our dataset,
03:19 and we’re going to slice out
03:23
'FGA'
(field goals attempted)
03:40
We’ll have an opponent column and we’ll have all these particular things. Excellent. So now we’re ready to start setting up a vincent
table to represent that.
03:54
So let’s say if we wanted to get a nice visual of trying to represent a particular player’s field goal attempts against a particular team in a bar graph so we could send it to our friends and have them take a look at it and easily consume the data. What we’ll need to do is to use vincent
to create a graph.
04:12
The type of graph we’re going to look for is a stacked bar graph, which takes a DataFrame
like so, which has two sets of data which are very similar—which is field goals—and it’ll stack each of the values on top of one another and using this as the particular key to find which bar to attribute each of these values to.
04:30
So, this DataFrame
that we have here isn’t easy to consume. You’d have to compare each team one by one to look at it. But if you’re interested in comparing how the player plays against a particular team across the season, what you want is a nice visualization to represent that so it’s easy to consume.
Become a Member to join the conversation.