Getting to Know DataFrame Objects
01:04 Here, you can see that the average number of points scored in a game for this data set is 102 and that the highest scoring game had 186 points. But for some of the columns, it doesn’t make sense to take these stats.
.value_counts() method will count the number of times each value occurs in a particular column,
team_id in this example. So the team
BOS played the most games with
5997, and the team
SDS played the fewest with
6024 games and the franchise
60. Now, when the team franchise Lakers is mentioned, most people will think of the Los Angeles Lakers. In this data set, the team ID for the Los Angeles Lakers is
LAL, and the team
LAL only has 5,078 games.
So, who are the other 1,000 games played by? This can be done in several steps that can be expressed with a single line of code. First, you want to find all of the rows where the
You’ll also learn more about
.loc later in the course. But notice the number of rows returned. There are 6,024 rows returned and 6,024 values with
'Lakers' in the
'fran_id' column. Now, you don’t want all of the columns, just the
LAL team, the Los Angeles Lakers, played 5,078 games, but there was also a team called the Minnesota Lakers that played 946 games, for a total of 6,024 games played by the Lakers franchise.
It’s unlikely you have heard of the Minnesota Lakers, and here’s why. Get all of the rows that have a
'MNL'—that’s the ID for the Minnesota Lakers—and then get only the
'date_game' column, which is the date that the game was played.
As you can see, it’s been a while since the Minnesota Lakers have played basketball. But just how long? Intuitively, you would just get the most recent date or maximum value, and that’s stored in the
The problem is that the dates are stored as strings or objects in the
DataFrame. This causes some issues. First, here’s the code to get the maximum value from the
'date_game' column with the team ID of
This code works, and it works correctly. It returns the maximum value in the column, and the maximum value in the column is the string
'4/9/1959'. However, strings and dates are interpreted differently by Python, and Pandas does not implicitly cast a string in date format to a date.
So that you can compare the strings and dates, store the dates in a new column. Creating a new column on a
DataFrame is as simple as assigning to the column name. Now if you run the same code again but on the
'date_played' column, it returns a date of March 26, 1960. However, since the string representation of March 26, 1960, which would be
'3/26/1960', precedes the string
'4/9/1959', the latter is incorrectly returned as the latest date in the
The total number of points scored is 626,484 across the team history. But what was the average number of points per year? First, get the dates of the games played by the Boston Celtics, and you already know how to do this. You’re only interested in the years, so you can apply a
lambda function to each date.
07:28 Now you just saw another powerful feature of the Jupyter Notebook. You were able to modify a cell without retyping it. This lets you experiment and iterate rapidly, and that is a big part of exploratory data analysis.
Become a Member to join the conversation.