Write pandas Objects Directly to Compressed Formats
Since pandas version 0.21.0 you can save your DataFrames in a compressed format to save space.
Have a look at this short and sweet recipe to save a DataFrame in a compressed format using gzip
:
abalone.to_json('df.json.gz', orient='records',
lines=True, compression='gzip')
Watch the video to learn more about it.
Congratulations, you made it to the end of the course! What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment in the discussion section and let us know.
00:00 You’ve made it. In this last video, you’re going to learn how to take Pandas objects and put them directly into compressed formats. Sometimes the DataFrames you’re working with can get very large and it can be a hassle to save them in a non-compressed format.
00:14 Pandas actually added support in version 0.21.0 to compress these objects directly from Pandas. So let’s take the data set from the settings video and see how this works in the terminal. I’m going to copy that, open up a terminal, and start the Python interpreter.
00:34
Now import pandas as pd
, paste everything in, and there you go. So let’s say you’re doing some work on this dataset and you’re ready to save it.
00:46
You can take the DataFrame, and then if you’re going to save it as a JSON file, you could do .to_json()
, just call it 'df.json.gz'
,
01:07
set lines=True
, and for compression
, you can actually put in 'gzip'
. Before I run this, let me open up my project viewer.
01:22
And there you go. You can see the compressed version of that file has been saved. Let’s take this a step further to show the significance of this. So import os.path
and then take that DataFrame again, and this time save it as an uncompressed JSON file, so just 'df.json'
.
01:44
orient='records'
again, and set lines=True
. And there you go. Now the uncompressed version is saved as well. With os.path
you can call the getsize()
method, so if we did this on 'df.json'
and then divided that by the size of the compressed version,
02:13 you can see that the uncompressed version is almost 10 times larger than the compressed version. When you’re dealing with large data sets, this can make a huge difference, so think about using compression before you save your next Pandas object.
Joe Tatusko RP Team on April 15, 2019
Glad you enjoyed it! Feel free to reach out if you have any questions :D
senatoduro8 on July 17, 2019
I love the clipboard trick. It’s my favorite so far and it allow me to copy data from the “supporting material” page and get working with without having to save it first as file because it’s a throw away file anyway.
Thanks for the tutorial
Joe Tatusko RP Team on July 18, 2019
Yeah! Such a neat little feature that goes mostly unnoticed. Glad it could help speed up your workflow!
Pygator on Nov. 28, 2019
Finally finished, I had forgotten about this course, but more videos from you on Core Pandas datastructures would be nice. Great tips. Also, you sound like the lead actor in Boyhood; I recently watched that movie.
Pakorn on Dec. 18, 2019
Great tips, Thanks!
Fahim on Aug. 14, 2020
At last completed it. Great content.
feygin on May 7, 2021
One of the most comprehensive and usefull cources so far!
Become a Member to join the conversation.
andrewcheryl on April 12, 2019
Awesome course - full of really useful tips. Thankyou !