Use Categorical Data to Save Time and Space
Did you ever find yourself in the situation, where you wanted to process a larger DataFrame and the operations seem to hang up for more than a few seconds?
Then you may want to try out one of pandas powerful features: The Categorical
dtype.
In this lesson you’ll learn how to make use of the Categorical
dtype to save you time and space.
00:00
Now it’s time to learn what categorical data is and how you can take advantage of it to make your programs run better. In an earlier video, you learned how to use the .str
(string) and .dt
(datetime) accessor methods to work with Pandas DataFrames and Series.
00:15 I mentioned that the categorical version would get its own video, and here it is. If you find yourself working with large DataFrames, you may have noticed them take a while to run functions on. This can be due to the number of items that need to be passed into the function, or just the sheer size of the DataFrame itself in memory. In certain cases, categorical data can cut this down considerably.
00:37
So let’s take a look at this Series here. I’ve already typed it out. If you notice, these are pretty long strings here, and there’s a couple of repeats. 'burnt orange'
appears a couple of times, 'rose'
appears a couple of times. But as it’s written, Pandas will have to treat each of these individually and store each of them individually.
00:57 Let’s open up the interpreter and see if there’s some things we can do to fix this. Open up Python,
01:07
import pandas as pd
, and I’m just going to paste that Series in.
01:15
So, if I call colors
—everything works. So, what if we wanted to replace each of these names here with something like an integer that would take up a lot less space?
01:27
Give 'periwinkle'
a value of 0
—that doesn’t appear anymore. 'mint green'
, a value of 1
, and it shows up again here, so then this would be 1
as well.
01:37
And so on. We can try that out. One way is to—let’s just say mapper =
{v: k for k, v in enumerate(colors.unique())}
. So, there’s a lot going on in this line. Let’s break it down.
02:01
Let me run it real quick. Each unique color in that Series, enumerate()
is going to—in order—assign them an integer. That integer will become k
and the color will be v
. mapper
will now be a dictionary that has keys of the color and then will return the value of the integer that we’re looking for.
02:23
So you can just call mapper
and see what you get. So if you put 'periwinkle'
in, you’ll get a 0
. 'mint green'
, you get 1
, 'burnt orange'
, you get 2
, and so on.
02:33
So now if you want to make a new Series, you could say something like as_int = colors
and then map that mapper
dictionary to it. And now if you take a look at as_int
, rather than having the color names, you have the color integers that you assigned with mapper
.
02:52
So if you can imagine, this is going to take up a lot less space than that other Series would. Fortunately, you have to remember to make this mapper
dictionary and then call that on the Series. If you want to make a new Series, we’ll just call this ccolors
, for categorized colors, and set this equal to colors
, cast this .astype('category')
, like so. Now you can actually call that .cat
accessor function on this, so if you did something like colors.cat
and then wanted to return the .categories
,
03:31
you’d get an Index
here. Note that our mapper ordered everything by the order it appeared in the Series, and this is alphabetical. If you wanted to return something that was very similar to what we did with mappers, though, you could just say ccolors.cat
, and then return the .codes
.
03:49 And now you have these integer values that match up with the categories up here. This can be a good way to have columns in your DataFrame or your Series that take up a lot less memory. A nice thing, too, is that a lot of the methods that you’ll use will work on these underlying categories themselves, so rather than changing these values in every single row, you’re just working with integers.
04:14
The only problem with turning these to categories is you do run into the issue where you lose some flexibility. Let’s say you had something like ccolors
and you wanted to insert a new item that didn’t exist.
04:33 You can see that doing things the normal Pandas way is not as friendly and gives you quite the error. Instead, you’ll have to take that Series,
04:46
then actually call .cat
and then .add_categories()
,
04:51 then pass in a list of what you’d like.
04:58 And now if you wanted to insert that value like we did before,
05:09 there’s no error. So, categories might not be the best solution if you’re going to be changing the amount of unique values in your Series or DataFrame, but if you’re working with a large data set and you have columns that only have a few unique values, they can be very helpful.
05:27 Like the other accessor methods, this is a somewhat confusing topic, so try to find places you can use these and get as much practice as you can. Thanks for watching.
ᴙɘɘᴙgYmɘᴙɘj on Sept. 14, 2021
colors = pd.Series([
'periwinkle',
'mint green',
'burnt orange',
'periwinkle',
'burnt orange',
'rose',
'rose',
'mint green',
'rose',
'navy',
])
Become a Member to join the conversation.
Matt Williams on Aug. 18, 2020
Seems like the
to_datetime
method which was used to re-index the dataframe in the 5th video was able to parse the YYYY-MM-DD format on its own. Is this explicitly true, or does that method require data to be supplied in that order?