Asynchronous Iteration
For more information on concepts covered in this lesson, you can check out:
00:00
In the previous lesson, I spoke about the new type hinting and annotation features of Python 3.10. In this lesson, I’ll show you two new standard library functions, aiter()
and anext()
, which provide support for asynchronous iteration.
00:15
You may have come across the iter()
and next()
functions in Python. Used together, these create an iterator and get the next item out of it.
00:23
This is actually the underlying mechanism for the for
loop. The for
loop creates an iterator on the object being looped over, and each instance of the loop calls next()
on that iterator.
00:35
The next function raises a StopIteration
exception when there is nothing left to iterate on. The for
loop catches this exception and exits the code block. Python 3.5 introduced two new keywords, async
and await
, to implement coroutines.
00:52
A coroutine is a way of doing asynchronous code execution also known as parallel execution. Coroutines are an alternative to using the threading
library. To write a coroutine, you must declare a function using the async
keyword.
01:09
An asynchronous function has some restrictions, the key one being it can’t call synchronous code. Otherwise, it just becomes synchronous again. And you’ll never guess what kind of code iter()
and next()
are. Yep, they’re synchronous.
01:24
That means that up until Python 3.10, using an iterator inside your asynchronous code implicitly created a synchronous block. This was limiting. Python 3.10 has introduced two new functions, aiter()
and anext()
, the a
meaning asynchronous. Using these, you can now create asynchronous iterators with aiter()
and get items from them with anext()
inside of your async functions.
01:53
Writing asynchronous code is more complicated than writing synchronous code. The notes below have links to entire courses on this subject. The only way to demonstrate aiter()
and anext()
is to do so inside of some asynchronous code. I’m going to do that in just a second, but if this isn’t your wheelhouse, feel free to skip to the next lesson.
02:13 The purpose of asynchronous code is to do multiple things at once. There are two kinds of parallelism available on your computer: multi-CPU and I/O bound. As the name implies, multi-CPU executes different chunks of code on different processors. I/O bound is different.
02:31 It still runs on one processor but swaps between different chunks of code at a time. Accessing the disk or network is actually very slow in comparison to most CPU operations, which means synchronous code tends to sit around waiting a lot.
02:48 Coroutines are I/O bound parallelism. They operate on a single CPU but allow you to run a second code block while the first is waiting on input from disk or network.
02:59
The example I’m going to show you reads in multiple files from the disk at a time and counts the number of newlines in each file. It uses a third-party library called aiofiles
, so if you’re coding along with me, you’ll need to run pip install
. As always, it is best practice to do this in a virtual environment.
03:21
Here is the asynchronous line counting code. First off, it needs to import asyncio
. This library is used to manage each of the coroutines that I’ll be creating.
03:32
The aiofiles
library provides alternate implementations of file operations, like open()
, that are asynchronous. Let me just scroll down to the bottom here.
03:44 This code is a bit easier to understand if you start with the execution.
03:49
The run()
function of the asyncio
library takes an asynchronous function and executes it. This encapsulates the coroutine mechanism and the underlying threading structure.
04:00
On line 30, I’m running the all_files()
async function, passing it whatever arguments were sent in on the command line.
04:10
What makes all_files()
able to be asynchronous is the async
keyword attached to the function declaration. This function is responsible for setting up all the coroutines and then waiting for them to complete. The for
block starting on line 22 loops through the filenames passed in on the command line, and line 23 creates a coroutine for each of them. The coroutine is also an async function, this one called count_lines()
.
04:37
count_lines()
takes a single filename as a parameter. So, if ten files are passed in on the command line, then ten coroutines are created—one for each file.
04:48 The coroutines do the work of counting newlines and then return. The process of creating the coroutine returns almost immediately. Because it’s asynchronous, it doesn’t wait for the wrapped function to return.
05:02
Line 24 appends the newly created coroutine into a list so that you can track all of the coroutines that are currently running. Now, the await
keyword. await
indicates that this is a boundary between asynchronous and synchronous code.
05:18
The gather()
function takes all the task coroutines that were generated and says to wait here until all of them have finished executing.
05:27
Let’s look at the coroutine that counts the newlines in an individual file. Line 6 declares count_lines()
and indicates that it also is an asynchronous function.
05:38
Line 9 is an asynchronous context manager using the open()
function from the aiofiles
library. This is an asynchronous replacement for Python’s open()
. It does the same thing, namely opening a file, but it does that in a fashion that you can do asynchronous operations on it.
05:57 Note that this file is being opened in binary mode. Although the code is looking for newlines, and so really is only meaningful with text files, there is a gotcha here. Python by default opens text files using the standard file encoding for the operating system. On Linux and Mac, that’s UTF-8.
06:16
On Windows, it varies by platform. If a text file isn’t UTF-8 and you open it as UTF-8, you’ll crash. Instead, then, you open the file in binary mode and that avoids this problem. Line 10 uses the new aiter()
function to create an asynchronous iterator based on the contents of the newly opened file.
06:39
Iterating on a file opened by the aiofiles.open()
function will split it based on lines, meaning it’s looking for the newline character. Inside the while
loop, anext()
is called on the iterator, getting the next line. With each line, the counter is incremented.
06:58
And then, like with the synchronous iterator, anext()
raises an exception if the iterator is done. Instead of being a StopIteration
exception, it rather appropriately raises a StopAsyncIteration
exception. In this case, nothing needs to be done when the iterator is empty, so the code breaks out of the infinite loop.
07:18
The final action of count_lines()
is to print a result. The extra parameters to print()
ensure that each coroutine prints on the same line and flushes the print buffer immediately. Okay, let’s run this sucker.
07:41
Here, I’m counting the newlines in all the PDF files in my Downloads/
directory. PDF files are actually binary, but because of the UTF-8 assumption forcing the file to be read in binary mode, that doesn’t cause a problem. And although they are binary, they do have strings inside of them, which contain newlines, so there’s something to count.
08:01 The numbers are generally going to go up as the code executes as the bigger the file, the longer it takes for the coroutine to run. To demonstrate that this is all happening asynchronously, let me run the code again.
08:17
If you follow along on the output from the two executions, you’ll note that the numbers are showing up in a different order. The first case of this is nine numbers in, where 679
and 681
have changed place.
08:31 They’re in a different order because there’s no guarantee of execution order with asynchronous code. This is one of the many things that makes debugging parallelism that much more difficult than synchronous code. Parallel coding always breaks my brain just a little bit. Time for something lightweight and breezy. Ooh, statistics?
Become a Member to join the conversation.