How to Use multiprocessing.Pool()
In this lesson, you’ll dive deeper into how you can use
multiprocessing.Pool. It creates multiple Python processes in the background and spreads out your computations for you across multiple CPU cores so that they all happen in parallel without you needing to do anything.
You’ll import the
os module in order to add some more logging to your
transform() function so you can see what’s going on behind the scenes. You’ll use
os.pid() to see which process is working on which record, so you’ll be able to see that the same processes are working on different data in different batches. You can even set the number of processes you want to be working at once.
Now, what is going on here? This is the magic of the
multiprocessing.Pool, because what it does is it actually fans out, it actually creates multiple Python processes in the background, and it’s going to spread out this computation for us across these different CPU cores, so they’re all going to happen in parallel and we don’t have to do anything.
result we get is exactly the same. The
multiprocessing.Pool fans out and does all these computations for us, applies the
transform() function, and then brings back the results and assembles this output data structure, here, so we get exactly the same result, here, which is pretty cool!
Now, there’s a couple more parameters we can tweak here, and I really want to make sure that you see what’s going on behind the scenes here. The first thing we’re going to do is we’re going to add some more logging to this
transform() function so we can see what’s going on behind the scenes.
and they’re working on stuff in parallel. And then, they’re being reused. So, in the second batch, the same processes—again—are working on other data. And we can influence this behavior, so, we can actually put a limit and say, “Well, I only want
1 process in this
Pool here.” And when I run this again, you can see here, well, now we have a single process that’s doing all of the computations. And again, we’re going to end up with a seven-second runtime.
If you look at the log here, you can see exactly that it’s the same process processing each record one by one, and there’s no parallel computation happening. And now, if I crank this up, I can say, “Well, we want
2 processes.” Now, if we run this again, we have two processes working in parallel and they’re processing these records in parallel, and now we get a little bit of a speedup.
02:54 And, of course, I can go crazy here and I can actually say, okay, I want a process for each record here. The number of processes should be the number of records in this thing here, and that way we can process all of them in parallel and we can really cut down our time to complete to a second, which is about as long as it should take to process a single element.
But if you imagine that this was a call that was waiting on I\O, or it was waiting for a website download to complete—if this was a web scraper of sorts—you could do all of these things in parallel and with with the
multiprocessing.Pool you can really influence how these operations are performed across multiple CPUs,
or multiple CPU cores, and you have a lot of flexibility in configuring it. There’s other options here. For example, we could say, “Okay, I want
2 processes,” and there’s another setting that’s called
maxtasksperchild (max tasks per child), and I can say, “Well, I want
2 processes in parallel and I want the process to restart after it has completed a number of tasks.” So,
if you run this, we’re going to get a slightly different output here. Again, if we look at the process ID, you can see here, okay, we’re starting out with
04, and those are doing some processing.
multiprocessing.Pool you can really influence how it’s distributing your calculations across multiple CPU cores. Of course, you can also leave out these settings and what it’s going to do then—it’s going to spawn as many processes as you have CPU cores.
Become a Member to join the conversation.