Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Specifying Data Types

00:00 When you imported the nba DataFrame, Pandas attempted to infer the data type for each column based on its values. Take a look at the column data types again.

00:11 Like you’ve seen before, there are a number of columns with the data type of object. Generally, this is a catchall when Pandas can’t figure out a data type.

00:22 You already saw an example of this earlier in the course with the 'date_game' column. Right now, it’s an object,

00:30 but the values look like dates. Since dates can be represented in many ways, Pandas played it safe and instead of assuming that if it talks like a date and walks like a date, that it must be a date, used an object instead. However, Pandas provides the .to_datetime() function, which accepts a Series, or a DataFrame column, and converts the values to Python datetimes.

00:56 You’ve seen that some columns only have a few distinct values—for example, the 'game_location' column. Again, this column is of type object. However, the values represent categories or classes, as 'H' is home, 'A' is away, and 'N' is neutral. Thus, you can use the Categorical data type, which is specific to Pandas, to represent those values more efficiently.

01:22 And if you look at the 'game_location' column, you’ll see that the data type is now category. This has two advantages. First, look at the memory usage of the nba DataFrame with the category data type. You’ll see that it is lower than the use of the object data type.

01:39 Pandas can make certain assumptions about the values now and can therefore optimize storage. Another potential benefit of categories is ranking them. Now, the nba dataset won’t use this because no one location is more important than the other, and you can see this by looking at the dtype attribute of the 'game_location' column.

01:59 Notice the ordered attribute is False. But suppose you had categories small, medium, and large, represented as ['S', 'M', 'L'].

02:08 If these were strings and you tried to compare them, then medium would be smaller than small, as 'M' precedes 'S' in the alphabet.

02:16 But you could tell Pandas that these are ordered. There’s another column which is currently an object, and it could benefit as a category.

02:24 The 'game_result' column stores if the game was a win or a loss, represented as either a 'W' or an 'L'. This could be stored as a category instead.

02:36 Notice that the memory usage has decreased even more. While these savings might not seem like a lot right now, keep in mind that Pandas can handle much more data.

02:46 As you work with larger data sets, these small improvements add up fast. In a perfect world, data would be ready to use when we import it into Pandas. In the next lesson, you’ll see that that’s not the case, and how to clean that data up and make it ready for exploratory data analysis.

Become a Member to join the conversation.