Specifying Data Types
00:00
When you imported the nba
DataFrame
, Pandas attempted to infer the data type for each column based on its values. Take a look at the column data types again.
00:11
Like you’ve seen before, there are a number of columns with the data type of object
. Generally, this is a catchall when Pandas can’t figure out a data type.
00:22
You already saw an example of this earlier in the course with the 'date_game'
column. Right now, it’s an object
,
00:30
but the values look like dates. Since dates can be represented in many ways, Pandas played it safe and instead of assuming that if it talks like a date and walks like a date, that it must be a date, used an object
instead. However, Pandas provides the .to_datetime()
function, which accepts a Series
, or a DataFrame
column, and converts the values to Python datetimes.
00:56
You’ve seen that some columns only have a few distinct values—for example, the 'game_location'
column. Again, this column is of type object
. However, the values represent categories or classes, as 'H'
is home, 'A'
is away, and 'N'
is neutral. Thus, you can use the Categorical
data type, which is specific to Pandas, to represent those values more efficiently.
01:22
And if you look at the 'game_location'
column, you’ll see that the data type is now category
. This has two advantages. First, look at the memory usage of the nba
DataFrame
with the category
data type. You’ll see that it is lower than the use of the object
data type.
01:39
Pandas can make certain assumptions about the values now and can therefore optimize storage. Another potential benefit of categories is ranking them. Now, the nba
dataset won’t use this because no one location is more important than the other, and you can see this by looking at the dtype
attribute of the 'game_location'
column.
01:59
Notice the ordered
attribute is False
. But suppose you had categories small, medium, and large, represented as ['S', 'M', 'L']
.
02:08
If these were strings and you tried to compare them, then medium would be smaller than small, as 'M'
precedes 'S'
in the alphabet.
02:16
But you could tell Pandas that these are ordered. There’s another column which is currently an object
, and it could benefit as a category
.
02:24
The 'game_result'
column stores if the game was a win or a loss, represented as either a 'W'
or an 'L'
. This could be stored as a category
instead.
02:36 Notice that the memory usage has decreased even more. While these savings might not seem like a lot right now, keep in mind that Pandas can handle much more data.
02:46 As you work with larger data sets, these small improvements add up fast. In a perfect world, data would be ready to use when we import it into Pandas. In the next lesson, you’ll see that that’s not the case, and how to clean that data up and make it ready for exploratory data analysis.
Bartosz Zaczyński RP Team on April 7, 2022
@pnmcdos Pandas follows the standard Python naming conventions. When you look closely at what pd.Categorical
and pd.to_datetime
are, then you’ll find out that one is a class while the other is a function:
>>> import pandas as pd
>>> pd.Categorical
<class 'pandas.core.arrays.categorical.Categorical'>
>>> pd.to_datetime
<function to_datetime at 0x7fc1db32a170>
Almost all names in Python, including function and variable names, will usually be written in lower case. Additionally, you will typically write compound names comprised of multiple words with snake_case. The two exceptions are class names, which use Pascal case (a variant of camelCase), and constants, which are in all upper case:
- Variable:
call_counter = 0
- Constant:
PI = 3.14
- Function:
def add_values(): ...
- Class:
class SingletonBeanFactoryLocator: ...
arthur55 on Feb. 10, 2024
When converting from the object to category datatype (c.1:16), how come you do not have to include the ‘inplace’ keyword argument, as for example when removing columns? This seemed to me like another time when you’d need to make the distinction between what is being called and what is being modified…
Become a Member to join the conversation.
pnmcdos on April 7, 2022
So as I followed along, I discovered the
pd.Categorical
call is case sensitive while thepd.to_datetime
is not. Any specific reason why that is? Will we just have to learn which ones are and are not case sensitive with time and practice? Or is there a more define reason behind the case sensitivity other than just happenstance?