How to Create a multiprocessing.Pool() Object
In this lesson, you’ll create a multiprocesing.Pool
object. This is an interface that you can use to run your transform()
function on your input data in parallel, spread out over multiple CPU cores. This Pool
instance has a map()
function, so you can map()
the transform()
function over scientists.
Now, when you run your program, you’ll see that you get the same result, but you get it a lot faster. This happened because you did your processing in two batches. In the next lesson, you’ll keep working with multiprocessing.Pool()
.
00:00
So, what I’m going to do now—we’re going to replace the sequential step here. We’re going to replace it with some multiprocessing code. What we need to do here, first, is we need to create a multiprocessing.Pool
object and we need to store that somewhere. A multiprocessing.Pool
, it’s basically an interface that we can use to run our transformation, or our transform()
function, on this input
00:29
data in parallel, spread out across multiple CPU cores. This Pool
instance, it has a .map()
function. I can say, “Okay, we’re going to map the transform()
function over the scientists
,
00:46
and this is our result.” This corresponds exactly to the sequential map()
function call, here.
00:54 I’m just going to clean it up a little bit and maybe bump up the font size again for you to see. And now, if we run this, what do you think is going to happen?
01:03 So remember, before, this took about seven seconds to execute. If we run this again, now—well, we’re getting a way different output, right? It looks like we’re actually starting the processing here for four records all at once, and then they all complete as a batch of four, and we have another three—I guess that’s the remaining ones—and then those complete as well.
01:26 We get the same result, but we get it a lot faster. Previously, it took us seven seconds, now we did it in two seconds. That happens because, well, we have these two batches here, essentially.
cdrr930725 on Dec. 1, 2019
I fix it by putting all the code into if __name__ == '__main__':
linusblady on March 27, 2020
cdrr930725 thanks for the tip. With if name == ‘main’: it runs. If you use the standard IDLE the print statement in the function will not be printed.
dorellaurent on April 8, 2020
Hello, I’m on Window 7 os. I work with IDLE. When I run the script, nothing is printed in the IDLE shell window. I tried with the if name == ‘main’ part and the issue was the same…
renatoamreis on April 27, 2020
I have exactly the same problem as reported above. I also tried with name == ‘main‘
Dan Bader RP Team on April 27, 2020
Quick update on running these examples with IDLE (or presumably also other REPL environments):
You’ll probably run into issues if you don’t run the examples with python your_script_name.py
from the command line, like I do in the video.
There are known issues with multiprocessing
and IDLE (see this StackOverflow discussion for example)
norcal618 on May 28, 2020
I was able to get the multiprocessing stuff to run by putting some of the code into a function as so…
def run():
start = time.time()
pool = multiprocessing.Pool()
result = pool.map(transform, scientists)
end = time.time()
print(f"\nTime to complete: {end - start:.2f}\n")
pprint(result)
if __name__ == '__main__':
run()
And the remaining code is left as show in the video
Arif Zuhairi on Oct. 9, 2020
I think because Dan run with Mac and we run with Windows and terminal/cmd prompt going crazy.. Thank you for the fix
Lucy on Oct. 13, 2020
Hi. i’m following very close your comments and fixes, but still gettin this error when using multiprocessing:
pickle.PicklingError: Can’t pickle <class ‘main.Scientist’>: it’s not the same object as main.Scientist
I also tried the norcal618’s code, but same error. Need to understand what is happening to continue learning and advancing. thanks in advance for any help
Daniel on April 12, 2021
Hi Lucy. I was getting a similar error as yours when working with Python 3.8 on macOS.
To solve it, you need to wrap almost all of the tutor’s code within an if __name__ == '__main__':
clause.
The only thing you need to leave outside of that if __name__ ...
clause is the line where we define the “Scientist” namedtuple
.
It’s important to do so. Otherwise, you’ll get the pickling error.
Here’s a working script. Note that I used a with
statement to wrap the multiprocessing.Pool()
stage. It’s not mandatory to do that, but like it better that way.
Daniel
PS: Here’s an explanation on why you need to put the namedtuple
declaration outside of the if __name__ ...
clause: stackoverflow.com/a/16377267/8909331
And if you have some experience programming, you might be able to follow this explanation: codefying.com/2019/05/04/dont-get-in-a-pickle-with-a-namedtuple/
import collections
import multiprocessing
import time
from pprint import pprint
Scientist = collections.namedtuple('Scientist', [
'name',
'field',
'born',
'nobel'
])
def transform(x):
print(f'Processing record {x.name}')
time.sleep(1)
result = {'name': x.name, 'age': 2017 - x.born}
print(f'Done processing {x.name}')
return result
if __name__ == '__main__':
scientists = (
Scientist(name='Ada Lovelace', field='math', born=1815, nobel=False),
Scientist(name='Emmy Noether', field='math', born=1882, nobel=False),
Scientist(name='Marie Curie', field='physics', born=1867, nobel=True),
Scientist(name='Tu Youyou', field='chemistry', born=1930, nobel=True),
Scientist(name='Ada Yonath', field='chemistry', born=1939, nobel=True),
Scientist(name='Vera Rubin', field='astronomy', born=1928, nobel=False),
Scientist(name='Sally Ride', field='physics', born=1951, nobel=False),
)
pprint(scientists)
print()
start = time.time()
with multiprocessing.Pool() as pool:
result = pool.map(transform, scientists)
end = time.time()
print(f'\nTime to complete: {end - start:.2f}s\n')
pprint(result)
Anand on June 8, 2021
This never gets executed in Jupyter notebook. Interpreter python 3.x
from immutable_data import scientists
import time
import multiprocessing
def transform(x):
print(f"Processing record {x.name}")
time.sleep(1)
result = {"name": x.name, "age": 2021 - x.born}
print(f"Done processing record {x.name}")
return result
if __name__ == '__main__':
start = time.time()
pool = multiprocessing.Pool()
pool.map(transform, scientists)
end = time.time()
print(f'\nTime to complete: {end - start:.2f}')
any solution?
Dan Bader RP Team on June 9, 2021
@Anand: What happens when you run this in a standalone script? Looks like using multiprocessing
from within a Jupyter notebook is generally bug-prone, e.g. see this thread here (plus related links): github.com/microsoft/vscode-jupyter/issues/941
Anand on June 9, 2021
@Dan: It worked perfectly outside Jupyter.
Processing record Ada Lovelace
Processing record Emmy Noether
Processing record Marie Curie
Processing record Tu Youyou
Done processing record Ada Lovelace
Processing record Ada Yonath
Done processing record Emmy Noether
Processing record Vera Rubin
Done processing record Marie Curie
Processing record Sally Ride
Done processing record Tu Youyou
Done processing record Ada Yonath
Done processing record Vera Rubin
Done processing record Sally Ride
Time to complete: 2.22
Thanks for the clarification.
Become a Member to join the conversation.
cdrr930725 on Dec. 1, 2019
When I run the code my terminal starts going crazy, and I never got the desire output. My code is a replica of the lecture code. Please check the link bellow:
codeshare.io/2B4mxy