Using Chunks
00:00 Use chunks to iterate through files. Another way to deal with very large datasets is to split the data into smaller chunks and process one chunk at a time.
00:11
If you use read_csv()
, read_json()
, or read_sql()
, then you can specify the optional parameter chunksize
.
00:22
chunksize
defaults to None
and can take on an integer value that indicates the number of items in a single chunk. When it’s set to an integer, read_csv()
returns an iterable that you can use in a for
loop to get and process only a fragment of the dataset in each iteration.
00:54
In this example, the chunksize
is 8
.
01:15
The first iteration of the for
loop returns a DataFrame
with the first eight rows of the dataset only. The second iteration returns another DataFrame
with the next eight rows, and the third and last iteration returns the remaining four rows. In each iteration, you get and process the DataFrame
with a number of rows equal to chunksize
.
01:40
It’s possible to have fewer rows than the value of chunksize
in the last iteration, as you’ve seen, and you can use this functionality to control the amount of memory required to process data and keep that amount reasonably small.
01:55 Now that you’ve seen techniques for working with big data, let’s review what you’ve learned in this course.
Become a Member to join the conversation.