[Optional] How to Use Generators and yield in Python

If you want to read more about generators, below is the tutorial that accompanies the video course you just watched.

This text is part of a Real Python tutorial by Kyle Stratis.


Have you ever had to work with a dataset so large that it overwhelmed your machine’s memory? Or maybe you have a complex function that needs to maintain an internal state every time it’s called, but the function is too small to justify creating its own class. In these cases and more, generators and the Python yield statement are here to help.

By the end of this lesson, you’ll know:

  • What generators are and how to use them
  • How to create generator functions and expressions
  • How the Python yield statement works
  • How to use multiple Python yield statements in a generator function
  • How to use advanced generator methods
  • How to build data pipelines with multiple generators

You can get a copy of the dataset used in this tutorial by clicking the link below:

Using Generators

Introduced with PEP 255, generator functions are a special kind of function that return a lazy iterator. These are objects that you can loop over like a list. However, unlike lists, lazy iterators do not store their contents in memory. For an overview of iterators in Python, take a look at Python “for” Loops (Definite Iteration).

Now that you have a rough idea of what a generator does, you might wonder what they look like in action. Let’s take a look at two examples. In the first, you’ll see how generators work from a bird’s eye view. Then, you’ll zoom in and examine each example more thoroughly.

Example 1: Reading Large Files

A common use case of generators is to work with data streams or large files, like CSV files. These text files separate data into columns by using commas. This format is a common way to share data. Now, what if you want to count the number of rows in a CSV file? The code block below shows one way of counting those rows:

Python
csv_gen = csv_reader("some_csv.txt")
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Looking at this example, you might expect csv_gen to be a list. To populate this list, csv_reader() opens a file and loads its contents into csv_gen. Then, the program iterates over the list and increments row_count for each row.

This is a reasonable explanation, but would this design still work if the file is very large? What if the file is larger than the memory you have available? To answer this question, let’s assume that csv_reader() just opens the file and reads it into an array:

Python
def csv_reader(file_name):
    file = open(file_name)
    result = file.read().split("\n")
    return result

This function opens a given file and uses file.read() along with .split() to add each line as a separate element to a list. If you were to use this version of csv_reader() in the row counting code block you saw further up, then you’d get the following output:

Python
Traceback (most recent call last):
  File "ex1_naive.py", line 22, in <module>
    main()
  File "ex1_naive.py", line 13, in main
    csv_gen = csv_reader("file.txt")
  File "ex1_naive.py", line 6, in csv_reader
    result = file.read().split("\n")
MemoryError

In this case, open() returns a generator object that you can lazily iterate through line by line. However, file.read().split() loads everything into memory at once, causing the MemoryError.

Before that happens, you’ll probably notice your computer slow to a crawl. You might even need to kill the program with a KeyboardInterrupt. So, how can you handle these huge data files? Take a look at a new definition of csv_reader():

Python
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

In this version, you open the file, iterate through it, and yield a row. This code should produce the following output, with no memory errors:

Shell
Row count is 64186394

What’s happening here? Well, you’ve essentially turned csv_reader() into a generator function. This version opens a file, loops through each line, and yields each row, instead of returning it.

Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Already a member? Sign-In

Locked learning resources

The full lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Already a member? Sign-In

You must own this product to join the conversation.