Specifying Data Types

Explore Your Dataset With pandas Douglas Starnes 03:06

Transcript
Discussion (3)

00:00 When you imported the nba DataFrame, Pandas attempted to infer the data type for each column based on its values. Take a look at the column data types again.

00:11 Like you’ve seen before, there are a number of columns with the data type of object. Generally, this is a catchall when Pandas can’t figure out a data type.

00:22 You already saw an example of this earlier in the course with the 'date_game' column. Right now, it’s an object,

00:30 but the values look like dates. Since dates can be represented in many ways, Pandas played it safe and instead of assuming that if it talks like a date and walks like a date, that it must be a date, used an object instead. However, Pandas provides the .to_datetime() function, which accepts a Series, or a DataFrame column, and converts the values to Python datetimes.

00:56 You’ve seen that some columns only have a few distinct values—for example, the 'game_location' column. Again, this column is of type object. However, the values represent categories or classes, as 'H' is home, 'A' is away, and 'N' is neutral. Thus, you can use the Categorical data type, which is specific to Pandas, to represent those values more efficiently.

01:22 And if you look at the 'game_location' column, you’ll see that the data type is now category. This has two advantages. First, look at the memory usage of the nba DataFrame with the category data type. You’ll see that it is lower than the use of the object data type.

01:39 Pandas can make certain assumptions about the values now and can therefore optimize storage. Another potential benefit of categories is ranking them. Now, the nba dataset won’t use this because no one location is more important than the other, and you can see this by looking at the dtype attribute of the 'game_location' column.

01:59 Notice the ordered attribute is False. But suppose you had categories small, medium, and large, represented as ['S', 'M', 'L'].

02:08 If these were strings and you tried to compare them, then medium would be smaller than small, as 'M' precedes 'S' in the alphabet.

02:16 But you could tell Pandas that these are ordered. There’s another column which is currently an object, and it could benefit as a category.

02:24 The 'game_result' column stores if the game was a win or a loss, represented as either a 'W' or an 'L'. This could be stored as a category instead.

02:36 Notice that the memory usage has decreased even more. While these savings might not seem like a lot right now, keep in mind that Pandas can handle much more data.

02:46 As you work with larger data sets, these small improvements add up fast. In a perfect world, data would be ready to use when we import it into Pandas. In the next lesson, you’ll see that that’s not the case, and how to clean that data up and make it ready for exploratory data analysis.

pnmcdos on April 7, 2022

So as I followed along, I discovered the pd.Categorical call is case sensitive while the pd.to_datetime is not. Any specific reason why that is? Will we just have to learn which ones are and are not case sensitive with time and practice? Or is there a more define reason behind the case sensitivity other than just happenstance?

Bartosz Zaczyński RP Team on April 7, 2022

@pnmcdos Pandas follows the standard Python naming conventions. When you look closely at what pd.Categorical and pd.to_datetime are, then you’ll find out that one is a class while the other is a function:

>>> import pandas as pd
>>> pd.Categorical
<class 'pandas.core.arrays.categorical.Categorical'>
>>> pd.to_datetime
<function to_datetime at 0x7fc1db32a170>

Almost all names in Python, including function and variable names, will usually be written in lower case. Additionally, you will typically write compound names comprised of multiple words with snake_case. The two exceptions are class names, which use Pascal case (a variant of camelCase), and constants, which are in all upper case:

Variable: call_counter = 0
Constant: PI = 3.14
Function: def add_values(): ...
Class: class SingletonBeanFactoryLocator: ...

arthur55 on Feb. 10, 2024

When converting from the object to category datatype (c.1:16), how come you do not have to include the ‘inplace’ keyword argument, as for example when removing columns? This seemed to me like another time when you’d need to make the distinction between what is being called and what is being modified…

Become a Member to join the conversation.