Specifying Data Types
but the values look like dates. Since dates can be represented in many ways, Pandas played it safe and instead of assuming that if it talks like a date and walks like a date, that it must be a date, used an
object instead. However, Pandas provides the
.to_datetime() function, which accepts a
Series, or a
DataFrame column, and converts the values to Python datetimes.
You’ve seen that some columns only have a few distinct values—for example, the
'game_location' column. Again, this column is of type
object. However, the values represent categories or classes, as
'H' is home,
'A' is away, and
'N' is neutral. Thus, you can use the
Categorical data type, which is specific to Pandas, to represent those values more efficiently.
And if you look at the
'game_location' column, you’ll see that the data type is now
category. This has two advantages. First, look at the memory usage of the
DataFrame with the
category data type. You’ll see that it is lower than the use of the
object data type.
Pandas can make certain assumptions about the values now and can therefore optimize storage. Another potential benefit of categories is ranking them. Now, the
nba dataset won’t use this because no one location is more important than the other, and you can see this by looking at the
dtype attribute of the
02:46 As you work with larger data sets, these small improvements add up fast. In a perfect world, data would be ready to use when we import it into Pandas. In the next lesson, you’ll see that that’s not the case, and how to clean that data up and make it ready for exploratory data analysis.
Become a Member to join the conversation.