CPU Bound Workloads

Speed Up Python With Concurrency Christopher Trudeau 07:47

Transcript
Discussion (6)

00:00 In the previous lesson, I introduced you to the multiprocessing library in Python. In this lesson, I’m going to show you the difference between I/O-bound and CPU-bound workloads.

00:11 So far, you’ve seen how to use the threading, asyncio, and multiprocessing libraries. All of the examples up until now have been I/O-bound. In an I/O-bound program, the vast majority of the time the program spends running is in waiting for input and output. Downloading content from the web is a perfect example.

00:31 The amount of latency involved in network transactions far outweighs the computation involved in downloading it. Not all software is I/O-bound. If you’re doing heavy mathematical computations, then most of the time is spent in computation.

00:47 The threading and asyncio libraries only see speed-up in I/O-bound cases, and that’s because they are scheduling computation while the other threads are waiting for the I/O.

00:59 You get speed-up because you’re not waiting for one piece of I/O at a time—you’re waiting for multiple pieces of I/O at a time. In the examples so far, you see about a seven times improvement against the best case scenario of a synchronous program when downloading 160 items from the internet.

01:17 You may have noticed that this speed-up is about the same across threading, asyncio, and even the multiprocessing library.

01:26 If your program instead is doing something that requires a lot of work inside of the CPU rather than waiting on a peripheral, then threading and asyncio is not going to give you that kind of improvement. To get improvement in CPU-bound software, you need multiple CPUs working simultaneously.

01:44 The multiprocessing library does allow you to do this. In the next examples, I’ll show you the differences.

01:52 This is the synchronous version of a CPU-bound program. It’s a bit of a convoluted example. It calculates the sum of the squares between zero and a large number, and in this case, it does that 20 times for the numbers 5,000,000 through 5,000,020. Line 4 defines the function where the calculation of the sum of the squares happens.

02:15 Line 7 is the wrapper function that uses a loop to call the calculation once for each of the numbers being passed in. And then inside of the __main__ section, line 12 defines a list—the numbers 5,000,000 through 5,000,020 to run this calculation on.

02:32 Let’s see this in practice.

02:39 I sped that up so you didn’t have to sit there and watch that, but it took just under 9 and a half seconds. You may recall from the synchronous version of the download program, that there was a large variety in how long it took the program to run, and that was because you’re dependent on the network, the response of the server, how many other people are hitting the server at the same time, and all sorts of things outside of your computer.

03:02 By contrast, CPU-bound programs tend to be consistent. Let me run this again so you can see it.

03:13 Again, I sped that up but it took about 9.4 seconds, so roughly the same amount of time. Because this program is spending all of its time exercising the CPU, the only thing that might slow it down is if the operating system is swapping it out for something else. It’s not having to wait on any I/O.

03:33 And now, here’s a threaded version of the software. The calculate() method does the same thing, the sum of squares. find_sums() has changed slightly. In this case, instead of just looping through all of the numbers, it creates a thread—one for each of the 20 numbers to be executed on. It then maps that number to the calculate() method, so the numbers 5,000,000 through 5,000,020 will each get their own thread for the calculation. Let’s see what happens.

04:06 Well, that’s not good. Because this program really isn’t I/O-bound, once it’s loaded into memory, all it’s going to do is do calculations on the CPU. All of the speed-up before was by taking advantage of the latencies of going to the peripherals. That’s not happening in this case.

04:25 The reason this is taking longer is because there’s overhead in creating and managing the threads. In this case, you’re spending about half a second to manage the threads and getting no advantage whatsoever. To speed up something that’s CPU-bound, you’ll actually need more CPUs. Threads or async I/O—which are essentially the same kind of concept—will not give you any speed-up, and in fact can cost you time.

04:52 This is the multiprocessing version. The code is essentially the same as the threading version, but lines 9 and 10 replace the threading Pool with the multiprocessing.Pool.

05:03 Here I go.

05:08 That’s significantly better. It’s a little better than a three times improvement over the synchronous version. Because this is CPU-bound, throwing more CPUs at it does speed things up. The computer that I’m running this on has four CPUs. Notice that it’s not four times better—it’s only three times better. Creating and managing the processes takes time.

05:31 There’s overhead involved in this. For this calculation, it’s about a 25% cost.

05:39 Donald Knuth said “Premature optimization is the root of all evil (or at least most of it) in programming.” What he means by this is: write your software first, figure out where the bottlenecks are, and then improve on those bottlenecks.

05:54 Most programmers do not actually have an intuitive grasp on where the bottlenecks in their software are. Every time I think I’ve known where it is, I’ve been wrong. You’re much better off writing your program first, measuring your program, and then figuring out how to improve it.

06:12 This leads to the question, “When do you use concurrency?” As you’ve seen, moving from synchronous to concurrent software has additional complications. There’s more code, you have to decide which type of concurrency you’re going to use, you have to worry about thread-safety and memory sharing. And then, of course, if you didn’t write the program right, you might end up with deadlocks or resource starvation.

06:37 So the first thing you really want to do is ask yourself, “Do you really need it?” There are certain domains where it makes an awful lot of sense. If you’re building a GUI, you probably want the GUI to be responsive, so having it on a different thread than the computation makes sense as a design.

06:53 If you’re doing a lot of data computation, then you want to take advantage of the fancy computer you bought and use all of its cores. I’m not saying don’t do concurrency—just make sure that you aren’t incurring the overhead of creating the more complicated code unless you actually need it.

07:10 Once you’ve concluded that you do need it, you need to decide on a model, and therefore a library to use. The easier question is I/O-bound versus CPU-bound, leading you to decide between multiprocessor, or thread and async I/O.

07:23 If you are doing I/O-bound, then you should favor async I/O over threads if you can. It tends to be a little lighter-weight and a little faster. That being said, the libraries that you’re interacting with may not be compatible with asyncio, so you may be left with threads. Next up, I’ll give you a course summary and point you at some further reading if you want to dig in deeper.

RobyB on Dec. 11, 2020

Wow ! What a lesson ! I will bookmark it, it’s very interesting, and clarifies a lot of things.

Tony Ngok on Feb. 6, 2024

So, in algorithms classes when we talk about the ‘time complexity’, are we focusing on the CPU bound computation?

Bartosz Zaczyński RP Team on Feb. 6, 2024

@tonyn1999 Time complexity is an abstract measure of how often your algorithm performs the dominant operation. Some algorithms have better time complexity than others when they achieve the same goal in fewer steps.

Consider the following example:

def load(path):
    with open(path) as file:
        for line in file:
            process(line)

Here, the time complexity depends on the number of lines in the file since the most important operation, process(line), is called once for each line. Depending on the nature of this processing, it could represent either an I/O-bound or a CPU-bound task. In particular, when process(line) just writes the line to another file or if it performs some complicated calculations.

Tony Ngok on Feb. 12, 2024

Given that:

Mathematic operations are CPU bound
Multiprocessing (parallelism) are for CPU bound processes, while multithreading (concurrency) are for IO bound tasks

Then, why does this link says that numpy.dot() (dot product of two matrices) uses multithreading instead of multiprocessing?

Bartosz Zaczyński RP Team on Feb. 12, 2024

@Tony Ngok The reason why multithreading is used for I/O-bound tasks while multiprocessing for CPU-bound tasks is the global interpreter lock (GIL). However, outside of Python, threads can indeed help in both types of situations.

NumPy is primarily implemented in C with bits of glue code that makes it possible to call its functions from Python. As such, it isn’t limited by the GIL, which is why NumPy can leverage threads to run code in parallel on multiple CPU cores.

If you’re interested, there are ways to bypass the GIL to achieve thread-based parallelism in Python.

Christopher Trudeau RP Team on Feb. 12, 2024

Hi @Tony,

Bartosz is of course right, the GIL is what makes Python worse at this stuff, and using low-level libraries gives you ways of working around it.

Fundamentally though, a single CPU(*) can’t really do more than one thing at a time. Threading takes advantage of the fact that often a CPU is waiting on something. I can’t speak to NumPy’s decision specifically, but 1) talking to memory means the CPU has to wait, 2) using a co-processor (possibly even on-chip for floating point operations) means the CPU has to wait. Any time the CPU has to wait for something there is a chance that switching threads can mean the CPU can do something else. Switching threads has a cost though, so each application has to figure out if the “waiting for something” is long enough to gain an advantage by switching contexts.

(*) even this is an over simplification in modern processors. Most CPUs now have vector operations, meaning you can perform an “add” (or most other math ops) on multiple numbers at a time. NumPy definitely takes advantage of this, in fact depending on which flags NumPy is compiled with and which processor it is running on, you can see significant differences in its performance. And even that is an oversimplification, as modern CPUs do black-magic things like branch prediction and pipe-line loading of instructions – switching contexts means those things might have to be performed again when the code switches back in.

If you’re trying to understand how all this works, you’re asking all the right questions. The short answer is: it is complicated and messy. If you’re trying to optimize actual code, the best thing to do is write code without concurrency first, measure it, then figure out if adding concurrency will give you the speed-up you’re look for.

Become a Member to join the conversation.