Getting to Know DataFrame Objects

Explore Your Dataset With pandas Douglas Starnes 07:55

Transcript
Discussion (3)

00:00 Start off by calling the .info() method on the nba DataFrame. This reveals some interesting data about the dataset.

00:10 First, it lists all 23 columns. For each one, it provides the number of non-null values and the data type. So, gameorder is an integer and game_id is an object.

00:24 More on that in a second. At the bottom, you can see that of the 23 columns, 6 are floats, 7 are integers, and 10 are objects. Also, none of the columns have null values except for notes.

00:40 So, what is this object? Notice that some of the values in the DataFrame are strings. In Pandas, the raw values in a DataFrame are stored using NumPy data types.

00:53 There is no string data type in NumPy, so it uses the generic object. Going further, Pandas can compute statistics about the DataFrame with the .describe() method.

01:04 Here, you can see that the average number of points scored in a game for this data set is 102 and that the highest scoring game had 186 points. But for some of the columns, it doesn’t make sense to take these stats.

01:19 For example, the average year is not going to be useful, although you can see that the year range is from 1947 to 2015. And notice that the stats are computed only for the numeric columns.

01:31 The objects were omitted. Pandas is a great tool for exploratory data analysis. This means looking around the dataset and navigating it to answer questions.

01:43 Take a look at the number of games played by each team.

01:48 The .value_counts() method will count the number of times each value occurs in a particular column, team_id in this example. So the team BOS played the most games with 5997, and the team SDS played the fewest with 11.

02:06 Do the same with the 'fran_id' (franchise ID) column.

02:11 The franchise Lakers played 6024 games and the franchise Huskies played 60. Now, when the team franchise Lakers is mentioned, most people will think of the Los Angeles Lakers. In this data set, the team ID for the Los Angeles Lakers is LAL, and the team LAL only has 5,078 games.

02:34 So, who are the other 1,000 games played by? This can be done in several steps that can be expressed with a single line of code. First, you want to find all of the rows where the 'fran_id' is 'Lakers'.

02:48 This will simply return a Series of bools. You’ll learn more about Series later in the course, but for now, think of it as a list with an index.

02:57 The values indicate if that row had a value of 'Lakers' in the 'fran_id' column. You can now use those bools to filter the DataFrame with the .loc attribute.

03:10 You’ll also learn more about .loc later in the course. But notice the number of rows returned. There are 6,024 rows returned and 6,024 values with 'Lakers' in the 'fran_id' column. Now, you don’t want all of the columns, just the 'team_id'.

03:29 And while you can already see that there is more than one team associated with the Lakers franchise, use the .value_counts() method to count them.

03:39 Again, the LAL team, the Los Angeles Lakers, played 5,078 games, but there was also a team called the Minnesota Lakers that played 946 games, for a total of 6,024 games played by the Lakers franchise.

03:57 It’s unlikely you have heard of the Minnesota Lakers, and here’s why. Get all of the rows that have a 'team_id' of 'MNL'—that’s the ID for the Minnesota Lakers—and then get only the 'date_game' column, which is the date that the game was played.

04:15 As you can see, it’s been a while since the Minnesota Lakers have played basketball. But just how long? Intuitively, you would just get the most recent date or maximum value, and that’s stored in the 'date_game' column.

04:29 The problem is that the dates are stored as strings or objects in the DataFrame. This causes some issues. First, here’s the code to get the maximum value from the 'date_game' column with the team ID of 'MNL'.

04:44 This code works, and it works correctly. It returns the maximum value in the column, and the maximum value in the column is the string '4/9/1959'. However, strings and dates are interpreted differently by Python, and Pandas does not implicitly cast a string in date format to a date.

05:07 However, it provides a helper function to do just that. The pd.to_datetime() function will convert a column of strings into dates, assuming the strings are valid date formats.

05:21 So that you can compare the strings and dates, store the dates in a new column. Creating a new column on a DataFrame is as simple as assigning to the column name. Now if you run the same code again but on the 'date_played' column, it returns a date of March 26, 1960. However, since the string representation of March 26, 1960, which would be '3/26/1960', precedes the string '4/9/1959', the latter is incorrectly returned as the latest date in the 'date_game' column.

06:03 Also, notice that if you call nba.info() again, the date_played column uses the NumPy datetime64 data type.

06:14 Let’s try one more exercise. You can easily find the total number of points scored by the Boston Celtics by getting only rows where the 'team_id ' is 'BOS' and the sum of the 'pts' column.

06:29 The total number of points scored is 626,484 across the team history. But what was the average number of points per year? First, get the dates of the games played by the Boston Celtics, and you already know how to do this. You’re only interested in the years, so you can apply a lambda function to each date.

06:52 The .apply() function will pass each value to the lambda and then all you need to do is extract the .year attribute. To remove the duplicates, call the .unique() method,

07:06 take the length, and you’re almost finished.

07:11 To get the total number of points again and store it, you can just modify the cell that calls the .sum() method.

07:19 Now you can calculate the average points per year.

07:24 It comes up to about 9,000 points each year.

07:28 Now you just saw another powerful feature of the Jupyter Notebook. You were able to modify a cell without retyping it. This lets you experiment and iterate rapidly, and that is a big part of exploratory data analysis.

07:42 So keep your eyes open for ways to use this and save yourself some trouble. In the next lesson, you’ll dive under the hood and see how Pandas DataFrames are assembled.

Anil verma on Dec. 7, 2021

Why the dataset is used for NBA, not everyone in the world knows NBA. I am struggling right now to understand the columns meanings and the Douglas is going in full speed. What why how, no need… I don’t get it. I am trying to pay attention, but ends up in searching pandas docs…

Aldir on Jan. 7, 2022

Dear Douglas and Real Python team,

Between ~ 03:57 and 06:00 of this lesson (‘Getting to Know DataFrame Objects’), Douglas warns about a date conversion issue from string object to NumPy datetime64 when using the pd.to_datetime().

Per transcription of the video at 05:21 Douglas says: “string representation of March 26, 1960, which would be ‘3/26/1960’, PRECEDES the string ‘4/9/1959’, the latter is incorrectly returned as the latest date in the ‘date_game’ column”.

As date transformation is a very important subject in the life of a Data Scientist work,

a) I would appreciate if Douglas or the Real Python team could clarify what he means by ‘PRECEDE’ and ‘the latter is incorrectly returned’ (at 04:44 he said the code that delivered ‘the latter’ work correctly) in the phrase above, or at least point to a good quality explanation on the matter.

After all, a tool that delivers an error of 273 days after conversion is very concerning and it is not a very good selling point for a package.

b) Also, I would like to take to the attention that in the column “date_game”, date was shown as (Month/Day/Year) and after transformation to “date_played’ dates are shown as (Year-Month-Day). That can be easily verified by the reader with the following code. I think that is a very important information that should be taken to the reader attention in this introductory Pandas course.

date_game_X_date_played = nba[[‘date_game’, ‘date_played’]] date_game_X_date_played

Thank you in advance for the attention.

torrepreciado on Jan. 18, 2022

@Aldir I would say that, if the column ‘date_game’ is interpreted as a string, 4 is evaluated to be after 3, which is why 4/9/1959 is evaluated as greater than 3/26/1960. pd.datetime has no bug whatsoever, not sure where you took that from.

Become a Member to join the conversation.