Creating Data Pipelines With Generators
In this lesson, you’ll learn how to use generator expressions to build a data pipeline. Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory.
For this example, you’ll use a CSV file that is pulled from the TechCrunch Continental USA dataset, which describes funding rounds and dollar amounts for various startups based in the USA. Click the link under Supporting Material to download the dataset included with the sample code for this course.
00:00 Welcome to the fourth video. This one is about creating data pipelines with generators. This is a very concrete example of a concrete problem being solved by generators. Let me first tell you a bit about the problem.
00:12 If you work with data in Python, chances are you will be working with CSVs, and the CSV looks like this. It’s basically a flat text file with lines of data which are separated by commas, where the first line is a series of column names.
00:26
So, they’re highlighted in this case: permalink
, company
, numEmps
, et cetera. And then below that, you have entries where each data point is separated by a comma. So this is a bit like a text version of a spreadsheet and, depending on the use case, they can get very, very big.
00:44 So even though this is just flat text, it can actually end up taking a lot of memory. That’s where the big in big data comes from.
00:52 If we were to open this in a sort of traditional way which doesn’t use generators, we’d probably do something like the code sample, which you have here in the top half of this slide.
01:02 So we’re writing a CSV reader, which takes in a file name, it opens that file, it reads each line one by one here, and then it splits the lines and it returns the result.
01:13 The problem with this is that it returns the result in one go, so this means that the entire result needs to be able to fit into your memory in one single operation. Depending on the use case, that might not be a problem, but if you’re working with very, very large datasets, then this can be a problem.
01:30
So, if the file is very big and you have a lot of data, then you might encounter an error like the one which is shown below in the second half of the slide. This is a MemoryError
and it happens when Python tries to work with an object which exceeds the amount of memory which is available to it.
01:45 Even if you don’t actually run out of memory, if you’re working on some kind of virtual instance where you’re paying for the resources that you use, the memory footprint has an impact on how much you’re spending.
01:56 So even if you have memory available, it still puts a strain on your budget, so there’s still a pay-off on trying to reduce that memory footprint. So, how can you do this with generators?
02:06 Here’s some code which works off of the TechCrunch data, which I showed you a minute ago. And it tries to answer one question, and the question is how much total Series A fundraising was carried out by those companies in the dataset.
02:20
What makes this code interesting is that it uses generators to avoid having to load the entire dataset into memory at once. So, let’s go through this line by line. On line 1, we’re just reading in the file name and storing it as file_name
.
02:36 The 2nd line already has a first generator. And you can tell that this is a generator because we’re using parentheses, as opposed to square brackets, which would be what we would use if we were working with a list.
02:48
So keep in mind that on line 2, lines
is a generator object, and what it’s returning is one of the lines, one by one, so one line at a time. Next, on line 3 we have another generator, and this one is splitting up those lines and removing trailing white spaces, so it’s cleaning up those lines. But again, we’re working with a generator, so list_line
, isn’t a full set of lines.
03:13
It’s instead a generator which will yield
lines one by one. On line 4 we’re just saving the column names, so we’re taking the very first line, which had the column names in the CSV, and saving that in cols
. Line 5 uses
03:28 a generator, again, to process those lines one by one and turn them into dictionaries. So we’re zipping them into dictionaries, where the column names are the keys and the data points are the values, so, the data. Next, between lines 6 and 10, what we’re doing is we’re really answering the question.
03:45
So, we’re going through the data that we’re processing and we’re extracting the amount of money which was raised, that’s line 7, and we’re extracting it as an integer for each company, so for each of those dictionaries, in those cases where the "round"
equals "a"
.
04:01 Then finally, we’re summing it and we’re printing our results. Let’s try running this. And there you go, you have a nicely formatted answer. But the answer isn’t really what I’m interested in, in this case.
04:13 The point which I’m trying to make is that since we’re using a sequence of generators which are sort of daisy-chained together so that each one is feeding into the next one, the impact on our memory is only one line of data at a time and at no point in the script do we have to deal with the entire dataset in one go.
04:32 I hope you found this example interesting, and that it’ll inspire your own work with big data. The next video is the conclusion video. In it, I’ll go over all of the main points which we discussed in this tutorial.
Anonymous on July 8, 2020
I computed the sum for the series A funding using my own code, and keep getting $4380015000 instead of $4376015000 (using the code in the video). Can’t figure out why my sum is larger. I used csv.DictReader, Pandas DataFrame and also Excel .
Thanks!
Jon David on Nov. 11, 2021
Generators of generators of generators…cool example!
presbyte8 on Dec. 14, 2021
Creating Data Pipelines With Generators
Fantastic example. I have not bothered to watch the lesson, just grabbed the source. I have gone through the code line by line with vscode debugger and got the general idea.
Very helpful!
Plan to came back and watch the course itself.
Thank you!
admin1 on Oct. 11, 2022
By the way, in the file, watermfrontmedia
company has 11 columns while other has 10 columns. I found it while trying to calculate the total funding without round condition.
Become a Member to join the conversation.
nelsonblue24 on June 28, 2020
If I understand correctly: * Lines 6-10 contain still another generator. * Line 11 sums all the values generated by that generator.
If I am correct, I think it would help to state those facts explicitly in the video.