Generators are functions that can be paused and resumed on the fly, returning an object that can be iterated over. Unlike lists, they are lazy and thus produce items one at a time and only when asked. So they are much more memory efficient when dealing with large datasets. This article details how to create generator functions and expressions as well as why you would want to use them in the first place.
To create a generator, you define a function as you normally would but use the
yield statement instead of
return, indicating to the interpreter that this function should be treated as an iterator:
1 2 3 4 5
yield statement pauses the function and saves the local state so that it can be resumed right where it left off.
What happens when you call this function?
1 2 3 4 5 6 7 8 9
Calling the function does not execute it. We know this because the string
Starting did not print. Instead, the function returns a generator object which is used to control execution.
Generator objects execute when
next() is called:
1 2 3
next() the first time, execution begins at the start of the function body and continues until the next
yield statement where the value to the right of the statement is returned, subsequent calls to
next() continue from the
yield statement to the end of the function, and loop around and continue from the start of the function body until another
yield is called. If yield is not called (which in our case means we don’t go into the if function because num <= 0) a
StopIteration exception is raised:
1 2 3 4 5 6 7 8 9 10 11 12
Just like list comprehensions, generators can also be written in the same manner except they return a generator object rather than a list:
1 2 3 4 5 6 7 8 9
Take note of the
parens on either side of the second line denoting a generator expression, which, for the most part, does the same thing that a list comprehension does, but does it lazily:
1 2 3 4 5 6 7 8
Be careful not to mix up the syntax of a list comprehension with a generator expression –
() – since generator expressions can run slower than list comprehensions (unless you run out of memory, of course):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
This is particularly easy (even for senior developers) to do in the above example since both output the exact same thing in the end.
NOTE: Keep in mind that generator expressions are drastically faster when the size of your data is larger than the available memory.
Generators are perfect for reading a large number of large files since they yield out data a single chunk at a time irrespective of the size of the input stream. They can also result in cleaner code by decoupling the iteration process into smaller components.
1 2 3 4 5 6 7 8 9
This function loops through a set of files in the specified directory. It opens each file and then loops through each line to test for the pattern match.
This works fine with a small number of small files. But, what if we’re dealing with extremely large files? And what if there are a lot of them? Fortunately, Python’s
open() function is efficient and doesn’t load the entire file into memory. But what if our matches list far exceeds the available memory on our machine?
So, instead of running out of space (large lists) and time (nearly infinite amount of data stream) when processing large amounts of data, generators are the ideal things to use, as they yield out data one time at a time (instead of creating intermediate lists).
Let’s look at the generator version of the above problem and try to understand why generators are apt for such use cases using processing pipelines.
We divided our whole process into three different components:
- Generating set of filenames
- Generating all lines from all files
- Filtering out lines on the basis of pattern matching
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
In the above snippet, we do not use any extra variables to form the list of lines, instead we create a pipeline which feeds its components via the iteration process one item at a time.
grep_files takes in a generator object of all the lines of
*.py files. Similarly,
cat_files takes in a generator object of all the filenames in a directory. So this is how the whole pipeline is glued via iterations.
Generators work great for web scraping and crawling recursively:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Here, we simply fetch a single page at a time and then perform some sort of action on the page when execution occurs. What would this look like without a generator? Either the fetching and processing would have to happen within the same function (resulting in highly coupled code that’s hard to test) or we’d have to fetch all the links before processing a single page.
Generators allow us to ask for values as and when we need them, making our applications more memory efficient and perfect for infinite streams of data. They can also be used to refactor out the processing from loops resulting in cleaner, decoupled code. How have you used generators in your own projects?
Want to see more examples? Check out Generator Tricks for Systems Programmers.