Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Working With CSV Files

Reading and Writing Files With pandas Darren Jones 06:42

Transcript
Discussion (2)

00:00 Working with CSV files. You’ve already learned how to read and write CSV files. Now let’s dig a little deeper into the details. When you use .to_csv() to save your DataFrame, you can provide an argument for the parameter path_or_buf to specify the path, name, and extension of the target file. path_or_buf is the first argument .to_csv() will get.

00:26 It can be any string that represents a valid file path that includes the filename and its extension. You’ve already seen this in a previous example. However, if you omit path_or_buf, then .to_csv() won’t create any files.

00:41 Instead, it will return the corresponding string.

00:55 As you can see onscreen, now you have the string instead of a CSV file. You also have some missing values in your DataFrame object. For example, the content for Russia and the independence days for several countries are not available. In data science and machine learning, you must handle missing values carefully, and pandas excels here.

01:17 By default, pandas uses the nan value to replace missing values. nan stands for Not a Number and it’s a particular floating-point value in Python.

01:29 You can get a nan value with any of these following functions seen onscreen. The continent that corresponds to Russia in the DataFrame is nan.

01:45 When you save your DataFrame to a CSV file, empty strings will represent the missing data. You can see this in your file data.csv and the string created onscreen earlier on.

01:57 If you want to change this behavior, then use the optional parameter na_rep.

02:12 This code produces the file new-data.csv, where the missing values are no longer empty strings. You can see the contents of that file onscreen now. Note, the string missing in the file corresponds to the nan values from the DataFrame.

02:31 When pandas reads files, it considers the empty string and a few others as missing values by default. If you don’t want this behavior, then you can pass keep_default_na=False to the pandas read_csv() function. To specify other labels for missing values, use the parameter na_values.

03:06 Here, you’ve marked the string '(missing)' as a new missing data label, and pandas replaced it with nan when it read the file.

03:15 When you load data from a file, pandas assigns the data types to the values of each column by default. You can check these types with .dtypes, as seen onscreen.

03:28 The columns with strings and dates—COUNTRY, CONT, and IND_DAY—have the data type object. Meanwhile, the numeric columns contain 64-bit floating-point numbers, float64.

03:42 You can use the parameter dtype to specify the desired data types and parse_dates to force the use of datetimes.

04:19 Now you have 32-bit floating-point numbers as specified with dtype. These differ slightly from the original 64-bit numbers because of smaller precision.

04:30 The values in the last column are considered as dates and have the data type datetime64. That’s why the NaN values in this column are replaced with NaT.

04:46 Now that you have real dates, you can save them in the format you’d like.

05:00 This date string specifies that the name of the month, then the day, followed by the full year, will be how the dates are represented.

05:22 There are several other optional parameters that you can use with .to_csv(). sep denotes a value separator, decimal indicates a decimal separator, encoding sets the file encoding, and header specifies whether you want to write column labels in the file.

05:43 Onscreen, you can see how to pass arguments for sep and header.

05:56 The data is now separated with a semicolon, and because header=False, the data is represented without the header row of column names. The pandas read_csv() function has many additional options for managing missing data, working with dates and times, quoting, encoding, handling errors, and much more. For instance, if you have a file with one data column and want to get a Series object instead of a DataFrame, then you can pass squeeze=True to read_csv().

06:30 You’ll learn later on about data compression and decompression, as well as how to skip rows and columns. But next up, you’ll be looking at how pandas can work with the popular JSON file format.

Brannen Taylor on July 23, 2022

I’m using python 3.9.13. When I use the squeeze paramater, it warns me it’s being depracated and to append .squeeze instead.

C:\Users\btaylor\AppData\Local\Temp\ipykernel_20504\420634629.py:2: FutureWarning: The squeeze argument has been deprecated and will be removed in a future version. Append .squeeze(“columns”) to the call to squeeze.

ser = pd.read_csv('single.csv').squeeze('columns')

Bartosz Zaczyński RP Team on July 25, 2022

@Brannen Taylor Good catch. The squeeze parameter has been deprecated in pandas version 1.4.0 and is irrelevant to your Python version.

Become a Member to join the conversation.