Resource linked in this lesson: DuckDB Python API Guide
Ingesting CSV Files
00:00 In the previous lesson, I demonstrated using SQL in a file-based database. In this lesson, I’ll show you how to ingest data from a CSV file.
00:09 CSV is probably the most common format for data, and if it isn’t, then Excel, the other leading contender, can output to CSV. A CSV file uses each line in the file as a row and has commas separating the values in the row into columns.
00:25 Hence, CSV: Comma-Separated Value file. It being everywhere is great. On the downside, it’s a very bare format. It doesn’t contain any data type information, or to be technically correct, it only ever contains text.
00:41
Most CSV readers attempt to detect the content of the text and put it into a logical data type. If a particular column only contains numeric values, then it treats it as an integer or a float for example. DuckDB’s read_csv()
method does this for you.
00:58 It’s not always right though.
01:01
The DuckDB site has a guide dedicated to reading CSV data, and one of the reasons you might need such a guide is the read_csv()
call takes a lot of arguments.
01:11
They’re there to help you control how the file gets turned into your relation object. I’ll briefly outline some of the more common ones. For example, the columns
parameter takes a dictionary and lets you specify what the name of each column is along with its corresponding data type.
01:27
If read_csv()
’s guess isn’t right, then this can go a long way to fixing your problems.
01:34
A common issue when guessing data types is the myriad of formats for dates. The dateformat
argument allows you to specify an expected date format.
01:43 If DuckDB sees a column using this format, it will automatically convert it to a date type. Localization can always be fun. North America uses a period as the separator between the whole and decimal part of a number.
01:58
Not all places do though, and you can provide the decimal_separator
argument. If you want to specify a comma instead. Of course, that leads to a different problem as the C in CSV is for comma.
02:10 Although the term CSV technically means a single kind of file, the term often gets used loosely to include similar formats that separate values with tabs or other special characters like the pipe symbol.
02:22
If your file uses something other than commas, you can tell the reader that with the delim
argument. Some CSV files include a header line, and some jump right to the data.
02:33
If yours has a header, setting header
to True
will tell DuckDB to use that information to populate the column names rather than include it as a row.
02:42 This short list is only swimming peacefully over the surface. Underneath, the webbed feet of arguments are paddling quickly.
02:49 Wow, that was a tortured metaphor. There’s lots of other arguments as well. That’s what I was trying to say. Let’s go play with CSV files.
02:59
I’ll start by fetching some data using the read_csv()
method.
03:07
Like with read_parquet()
, the result is a relation object. This CSV file has the same 47 rows of presidents as the parquet equivalent. Let’s take a quick look at the first few values.
03:20
The limit()
method on a relation object works like the head
command in Unix, in this case showing just the first five rows of data.
03:32
Take a close look at the data types that go with each of these columns. VARCHAR
is database speak for string, which is what you want for the first and last names, but not what you want for the start and end terms.
03:44 Those should be dates. This is where the lack of data type information in CSVs is problematic. Luckily, with just one of those magical arguments I mentioned before,
04:05 you can tell DuckDB what a date looks like in this file, and then it’ll properly detect that term start and term end are dates. Notice that when printing the result, DuckDB shows this in the ISO format regardless of the fact that the CSV used something else.
04:25 What’s the highest honor in waterfowl football? Being inducted into the Hall of Fame. I know, so funny, right? I quacked me up. So far, I’ve been simply letting the REPL print results to the screen.
04:39 In the next lesson, I’ll show you how to get at the results themselves.
Become a Member to join the conversation.