Creating Data Pipelines With Generators

Intermediate Python Deep Dive: Write Better Python and Build Better Systems Stephen Gruppetta 04:45

This lesson is from the Real Python video course by Christian Mondorf.

00:00 Welcome to the fourth video. This one is about creating data pipelines with generators. This is a very concrete example of a concrete problem being solved by generators. Let me first tell you a bit about the problem.

00:12 If you work with data in Python, chances are you will be working with CSVs, and the CSV looks like this. It’s basically a flat text file with lines of data which are separated by commas, where the first line is a series of column names.

00:26 So, they’re highlighted in this case: permalink, company, numEmps, et cetera. And then below that, you have entries where each data point is separated by a comma. So this is a bit like a text version of a spreadsheet and, depending on the use case, they can get very, very big.

00:44 So even though this is just flat text, it can actually end up taking a lot of memory. That’s where the big in big data comes from.

00:52 If we were to open this in a sort of traditional way which doesn’t use generators, we’d probably do something like the code sample, which you have here in the top half of this slide.

01:02 So we’re writing a CSV reader, which takes in a file name, it opens that file, it reads each line one by one here, and then it splits the lines and it returns the result.

01:13 The problem with this is that it returns the result in one go, so this means that the entire result needs to be able to fit into your memory in one single operation. Depending on the use case, that might not be a problem, but if you’re working with very, very large datasets, then this can be a problem.

01:30 So, if the file is very big and you have a lot of data, then you might encounter an error like the one which is shown below in the second half of the slide. This is a MemoryError and it happens when Python tries to work with an object which exceeds the amount of memory which is available to it.

01:45 Even if you don’t actually run out of memory, if you’re working on some kind of virtual instance where you’re paying for the resources that you use, the memory footprint has an impact on how much you’re spending.

01:56 So even if you have memory available, it still puts a strain on your budget, so there’s still a pay-off on trying to reduce that memory footprint. So, how can you do this with generators?

02:06 Here’s some code which works off of the TechCrunch data, which I showed you a minute ago. And it tries to answer one question, and the question is how much total Series A fundraising was carried out by those companies in the dataset.

02:20 What makes this code interesting is that it uses generators to avoid having to load the entire dataset into memory at once. So, let’s go through this line by line. On line 1, we’re just reading in the file name and storing it as file_name.

02:36 The 2nd line already has a first generator. And you can tell that this is a generator because we’re using parentheses, as opposed to square brackets, which would be what we would use if we were working with a list.

02:48 So keep in mind that on line 2, lines is a generator object, and what it’s returning is one of the lines, one by one, so one line at a time. Next, on line 3 we have another generator, and this one is splitting up those lines and removing trailing white spaces, so it’s cleaning up those lines. But again, we’re working with a generator, so list_line, isn’t a full set of lines.

03:13 It’s instead a generator which will yield lines one by one. On line 4 we’re just saving the column names, so we’re taking the very first line, which had the column names in the CSV, and saving that in cols. Line 5 uses

03:28 a generator, again, to process those lines one by one and turn them into dictionaries. So we’re zipping them into dictionaries, where the column names are the keys and the data points are the values, so, the data. Next, between lines 6 and 10, what we’re doing is we’re really answering the question.

03:45 So, we’re going through the data that we’re processing and we’re extracting the amount of money which was raised, that’s line 7, and we’re extracting it as an integer for each company, so for each of those dictionaries, in those cases where the "round" equals "a".

04:01 Then finally, we’re summing it and we’re printing our results. Let’s try running this. And there you go, you have a nicely formatted answer. But the answer isn’t really what I’m interested in, in this case.

04:13 The point which I’m trying to make is that since we’re using a sequence of generators which are sort of daisy-chained together so that each one is feeding into the next one, the impact on our memory is only one line of data at a time and at no point in the script do we have to deal with the entire dataset in one go.

04:32 I hope you found this example interesting, and that it’ll inspire your own work with big data. The next video is the conclusion video. In it, I’ll go over all of the main points which we discussed in this tutorial.

04:42 I’ll see you there!

You must own this product to join the conversation.