Working With CSV Files
00:00
Working with CSV files. You’ve already learned how to read and write CSV files. Now let’s dig a little deeper into the details. When you use .to_csv()
to save your DataFrame, you can provide an argument for the parameter path_or_buf
to specify the path, name, and extension of the target file. path_or_buf
is the first argument .to_csv()
will get.
00:26
It can be any string that represents a valid file path that includes the filename and its extension. You’ve already seen this in a previous example. However, if you omit path_or_buf
, then .to_csv()
won’t create any files.
00:41 Instead, it will return the corresponding string.
00:55
As you can see onscreen, now you have the string instead of a CSV file. You also have some missing values in your DataFrame
object. For example, the content for Russia and the independence days for several countries are not available. In data science and machine learning, you must handle missing values carefully, and pandas excels here.
01:17
By default, pandas uses the nan
value to replace missing values. nan
stands for Not a Number and it’s a particular floating-point value in Python.
01:29
You can get a nan
value with any of these following functions seen onscreen. The continent that corresponds to Russia in the DataFrame is nan
.
01:45
When you save your DataFrame to a CSV file, empty strings will represent the missing data. You can see this in your file data.csv
and the string created onscreen earlier on.
01:57
If you want to change this behavior, then use the optional parameter na_rep
.
02:12
This code produces the file new-data.csv
, where the missing values are no longer empty strings. You can see the contents of that file onscreen now. Note, the string missing in the file corresponds to the nan
values from the DataFrame.
02:31
When pandas reads files, it considers the empty string and a few others as missing values by default. If you don’t want this behavior, then you can pass keep_default_na=False
to the pandas read_csv()
function. To specify other labels for missing values, use the parameter na_values
.
03:06
Here, you’ve marked the string '(missing)'
as a new missing data label, and pandas replaced it with nan
when it read the file.
03:15
When you load data from a file, pandas assigns the data types to the values of each column by default. You can check these types with .dtypes
, as seen onscreen.
03:28
The columns with strings and dates—COUNTRY
, CONT
, and IND_DAY
—have the data type object
. Meanwhile, the numeric columns contain 64-bit floating-point numbers, float64
.
03:42
You can use the parameter dtype
to specify the desired data types and parse_dates
to force the use of datetimes.
04:19
Now you have 32-bit floating-point numbers as specified with dtype
. These differ slightly from the original 64-bit numbers because of smaller precision.
04:30
The values in the last column are considered as dates and have the data type datetime64
. That’s why the NaN
values in this column are replaced with NaT
.
04:46 Now that you have real dates, you can save them in the format you’d like.
05:00 This date string specifies that the name of the month, then the day, followed by the full year, will be how the dates are represented.
05:22
There are several other optional parameters that you can use with .to_csv()
. sep
denotes a value separator, decimal
indicates a decimal separator, encoding
sets the file encoding, and header
specifies whether you want to write column labels in the file.
05:43
Onscreen, you can see how to pass arguments for sep
and header
.
05:56
The data is now separated with a semicolon, and because header=False
, the data is represented without the header row of column names. The pandas read_csv()
function has many additional options for managing missing data, working with dates and times, quoting, encoding, handling errors, and much more. For instance, if you have a file with one data column and want to get a Series
object instead of a DataFrame
, then you can pass squeeze=True
to read_csv()
.
06:30 You’ll learn later on about data compression and decompression, as well as how to skip rows and columns. But next up, you’ll be looking at how pandas can work with the popular JSON file format.
Bartosz Zaczyński RP Team on July 25, 2022
@Brannen Taylor Good catch. The squeeze
parameter has been deprecated in pandas version 1.4.0 and is irrelevant to your Python version.
Become a Member to join the conversation.
Brannen Taylor on July 23, 2022
I’m using python 3.9.13. When I use the squeeze paramater, it warns me it’s being depracated and to append .squeeze instead.