Creating Data Pipelines With Generators
In this lesson, you’ll learn how to use generator expressions to build a data pipeline. Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory.
For this example, you’ll use a CSV file that is pulled from the TechCrunch Continental USA dataset, which describes funding rounds and dollar amounts for various startups based in the USA. Click the link under Supporting Material to download the dataset included with the sample code for this course.
00:00 Welcome to the fourth video. This one is about creating data pipelines with generators. This is a very concrete example of a concrete problem being solved by generators. Let me first tell you a bit about the problem.
00:12 If you work with data in Python, chances are you will be working with CSVs, and the CSV looks like this. It’s basically a flat text file with lines of data which are separated by commas, where the first line is a series of column names.
So, they’re highlighted in this case:
numEmps, et cetera. And then below that, you have entries where each data point is separated by a comma. So this is a bit like a text version of a spreadsheet and, depending on the use case, they can get very, very big.
01:13 The problem with this is that it returns the result in one go, so this means that the entire result needs to be able to fit into your memory in one single operation. Depending on the use case, that might not be a problem, but if you’re working with very, very large datasets, then this can be a problem.
So, if the file is very big and you have a lot of data, then you might encounter an error like the one which is shown below in the second half of the slide. This is a
MemoryError and it happens when Python tries to work with an object which exceeds the amount of memory which is available to it.
01:45 Even if you don’t actually run out of memory, if you’re working on some kind of virtual instance where you’re paying for the resources that you use, the memory footprint has an impact on how much you’re spending.
02:06 Here’s some code which works off of the TechCrunch data, which I showed you a minute ago. And it tries to answer one question, and the question is how much total Series A fundraising was carried out by those companies in the dataset.
What makes this code interesting is that it uses generators to avoid having to load the entire dataset into memory at once. So, let’s go through this line by line. On line 1, we’re just reading in the file name and storing it as
02:36 The 2nd line already has a first generator. And you can tell that this is a generator because we’re using parentheses, as opposed to square brackets, which would be what we would use if we were working with a list.
So keep in mind that on line 2,
lines is a generator object, and what it’s returning is one of the lines, one by one, so one line at a time. Next, on line 3 we have another generator, and this one is splitting up those lines and removing trailing white spaces, so it’s cleaning up those lines. But again, we’re working with a generator, so
list_line, isn’t a full set of lines.
It’s instead a generator which will
yield lines one by one. On line 4 we’re just saving the column names, so we’re taking the very first line, which had the column names in the CSV, and saving that in
cols. Line 5 uses
03:28 a generator, again, to process those lines one by one and turn them into dictionaries. So we’re zipping them into dictionaries, where the column names are the keys and the data points are the values, so, the data. Next, between lines 6 and 10, what we’re doing is we’re really answering the question.
So, we’re going through the data that we’re processing and we’re extracting the amount of money which was raised, that’s line 7, and we’re extracting it as an integer for each company, so for each of those dictionaries, in those cases where the
04:01 Then finally, we’re summing it and we’re printing our results. Let’s try running this. And there you go, you have a nicely formatted answer. But the answer isn’t really what I’m interested in, in this case.
04:13 The point which I’m trying to make is that since we’re using a sequence of generators which are sort of daisy-chained together so that each one is feeding into the next one, the impact on our memory is only one line of data at a time and at no point in the script do we have to deal with the entire dataset in one go.
04:32 I hope you found this example interesting, and that it’ll inspire your own work with big data. The next video is the conclusion video. In it, I’ll go over all of the main points which we discussed in this tutorial.
Become a Member to join the conversation.