Working With CSV Files
Working with CSV files. You’ve already learned how to read and write CSV files. Now let’s dig a little deeper into the details. When you use
.to_csv() to save your DataFrame, you can provide an argument for the parameter
path_or_buf to specify the path, name, and extension of the target file.
path_or_buf is the first argument
.to_csv() will get.
It can be any string that represents a valid file path that includes the filename and its extension. You’ve already seen this in a previous example. However, if you omit
.to_csv() won’t create any files.
00:41 Instead, it will return the corresponding string.
As you can see onscreen, now you have the string instead of a CSV file. You also have some missing values in your
DataFrame object. For example, the content for Russia and the independence days for several countries are not available. In data science and machine learning, you must handle missing values carefully, and pandas excels here.
By default, pandas uses the
nan value to replace missing values.
nan stands for Not a Number and it’s a particular floating-point value in Python.
You can get a
nan value with any of these following functions seen onscreen. The continent that corresponds to Russia in the DataFrame is
When you save your DataFrame to a CSV file, empty strings will represent the missing data. You can see this in your file
data.csv and the string created onscreen earlier on.
If you want to change this behavior, then use the optional parameter
This code produces the file
new-data.csv, where the missing values are no longer empty strings. You can see the contents of that file onscreen now. Note, the string missing in the file corresponds to the
nan values from the DataFrame.
When pandas reads files, it considers the empty string and a few others as missing values by default. If you don’t want this behavior, then you can pass
keep_default_na=False to the pandas
read_csv() function. To specify other labels for missing values, use the parameter
Here, you’ve marked the string
'(missing)' as a new missing data label, and pandas replaced it with
nan when it read the file.
When you load data from a file, pandas assigns the data types to the values of each column by default. You can check these types with
.dtypes, as seen onscreen.
The columns with strings and dates—
IND_DAY—have the data type
object. Meanwhile, the numeric columns contain 64-bit floating-point numbers,
You can use the parameter
dtype to specify the desired data types and
parse_dates to force the use of datetimes.
Now you have 32-bit floating-point numbers as specified with
dtype. These differ slightly from the original 64-bit numbers because of smaller precision.
The values in the last column are considered as dates and have the data type
datetime64. That’s why the
NaN values in this column are replaced with
04:46 Now that you have real dates, you can save them in the format you’d like.
05:00 This date string specifies that the name of the month, then the day, followed by the full year, will be how the dates are represented.
There are several other optional parameters that you can use with
sep denotes a value separator,
decimal indicates a decimal separator,
encoding sets the file encoding, and
header specifies whether you want to write column labels in the file.
Onscreen, you can see how to pass arguments for
The data is now separated with a semicolon, and because
header=False, the data is represented without the header row of column names. The pandas
read_csv() function has many additional options for managing missing data, working with dates and times, quoting, encoding, handling errors, and much more. For instance, if you have a file with one data column and want to get a
Series object instead of a
DataFrame, then you can pass
06:30 You’ll learn later on about data compression and decompression, as well as how to skip rows and columns. But next up, you’ll be looking at how pandas can work with the popular JSON file format.
@Brannen Taylor Good catch. The
squeeze parameter has been deprecated in pandas version 1.4.0 and is irrelevant to your Python version.
Become a Member to join the conversation.
Brannen Taylor on July 23, 2022
I’m using python 3.9.13. When I use the squeeze paramater, it warns me it’s being depracated and to append .squeeze instead.