00:00 Use chunks to iterate through files. Another way to deal with very large datasets is to split the data into smaller chunks and process one chunk at a time.
If you use
read_sql(), then you can specify the optional parameter
chunksize defaults to
None and can take on an integer value that indicates the number of items in a single chunk. When it’s set to an integer,
read_csv() returns an iterable that you can use in a
for loop to get and process only a fragment of the dataset in each iteration.
In this example, the
The first iteration of the
for loop returns a
DataFrame with the first eight rows of the dataset only. The second iteration returns another
DataFrame with the next eight rows, and the third and last iteration returns the remaining four rows. In each iteration, you get and process the
DataFrame with a number of rows equal to
It’s possible to have fewer rows than the value of
chunksize in the last iteration, as you’ve seen, and you can use this functionality to control the amount of memory required to process data and keep that amount reasonably small.
01:55 Now that you’ve seen techniques for working with big data, let’s review what you’ve learned in this course.
Become a Member to join the conversation.