Performing Better
00:00 In the previous lesson, I showed you how you can use a lambda at the apply stage of the GroupBy operation. In this lesson, I’ll cover how the wrong choice of grouping can cost you performance wise. pandas is a speedy little bear and for the most part it’s faster than the code you’d have to write in Python to get similar results.
00:20 That being said, there are fast and then there are faster ways in pandas. And as pandas is often used with larger datasets, the right choice for your code can make a significant difference.
00:31 Let’s dive back into the REPL with the news data to see what I mean.
00:36 Alright, let me import the DataFrame
00:40 and now I’ll redo the lambda-based apply call I did in the previous lesson.
00:47
As a reminder, this is grouping by the Outlook column using the title column in the output, then applying the lambda to that series. The lambda looks for string matches on ‘Fed’ then counts the True values using sum
, the output then gets chopped using nlargest
.
01:06 And like the comment says, because lambda is Python code, this means pandas has to pop up into the realm of Python to run it, leaving the lower order library as performance costs.
01:16 Perhaps there’s a more panda-centric way to do this same query. Let’s start by creating a column with the ‘Fed’ data.
01:30 This column is exactly what the lambda is doing and if you recall from deconstructing it in the previous lesson, it results in a series with True and False values.
01:42 This time though, I’m still operating down at the pandas level.
01:48
This should look familiar from the last lesson. Remember, a Series
is more than just a sequence. It has an index. That means columns and derived series from the same data will have the same shape.
02:06 And if they have the same index values in the same shape,
02:17 then you can use them to do the GroupBy. This code does the exact same thing as the previous use of apply, but this time without the need for a lambda. Mentioned in a previous lesson that I sometimes struggle with this.
02:31 My mental model of a DataFrame as a spreadsheet means I should only be doing things on the spreadsheet. But the mental model is only a model and it’s not quite right.
02:41 The GroupBy doesn’t care what you’re grouping on as long as it corresponds to the shape and index of the DataFrame. Result here will be significantly faster because you stay in pandas land.
02:52 Notice at the bottom of the results that the data type of the counter is 64. For efficiency, I can shrink that down as none of the numbers are that big and this will give you some more speed and less memory.
03:08 The smaller value I’m going to use is an unsigned int from NumPy, so I have to import it. pandas depends on NumPy so it’s already been installed. And
03:25
this is the same call as before but with the addition of the astype
method. This method converts the type into whatever you give it, which in this case is uintc
.
03:36 At the bottom, you can see that pandas refers to that as an unsigned 32-bit integer that results in some memory savings for you. How much difference does staying in pandas-land give you?
03:48 Well, let’s write some code and figure that out.
03:54
I’ve written a short program here called `news_perf.py. In it I’m using the timeit
module to time the two approaches that I’ve just shown you.
04:04
So first off, I import timeit
so I can use it. Then I grab the DataFrame from the news program same as I did in the REPL. In fact, this is one of the reasons I use the approach that I do with a small program.
04:15 Because that means I can then use the DataFrame in a bunch of different places and not have to write that code again. This first function uses the lambda apply approach that I showed you in the previous lesson.
04:26
It’s the exact same code as before, except I’ve added the observed=True
argument to groupby
in order to squish those warnings that are problematic.
04:36
And same goes here with the second approach that I just showed you using the mentions of ‘Fed’. Nested inside the f-string. in this print, I call the timeit
library, invoking the run_apply
function three times, then printing the result. And then I do the same, but with the run_vectorization
function.
04:55 Got it. Good. Let’s run this puppy. Bear, bear. I meant to say bear. Calling program. Waiting.
05:08
Still waiting. Oh, and there you go. The values there are the average times in seconds. Seeing as I called timeit
three times the first one took just shy of six seconds and the second one took about half a second.
05:23 So the lambda version is almost an order of magnitude slower than the panda-land one. The moral of the story: stay in panda-land, if you can. I have a sudden urge to play takenoko.
05:36 Ooh, deep cut, board gaming panda reference for the win.
Become a Member to join the conversation.