An Introduction to DuckDB

Christopher Trudeau

Starting With DuckDB and Python Christopher Trudeau 07:13

Transcript
Discussion

00:00 In the previous lesson, I gave a quick refresher on SQL. This lesson is your first foray into DuckDB. Quack. I was too lazy to edit the sound effect in that time.

00:12 DuckDB is a little different from your usual database in that it happily works on table-like things in memory, as well as in your more traditional file format.

00:21 This means you can load data into memory similar to how you would with a DataFrame in pandas or Polars, and then perform operations directly upon it. Once you’re happy with it, you can optionally send it down to a file database for future use. The key interface to this mechanism in the Python API is an object called DuckDBPyRelation.

00:42 It’s a bit of an ugly name, but you’ll never have to type it out. You’ll be getting these objects back in return from calls to the API. The methods on DuckDBPyRelation are very similar to those on a DataFrame, meaning you can inspect the data, get information about it, and perform calculations.

01:00 If you’re not familiar with pandas or Polars and haven’t played with DataFrames before, don’t worry about it. I’ll walk you through the operations. DuckDB interacts with a variety of file formats including CSV, JSON, Parquet and much more.

01:14 Some of these can be ingested through read methods in the library while others are available through plugins. Let’s waddle over to the REPL and take a quack at the API.

01:24 Yeah, I know that’s bad. It’s only going to get worse from here. Okay, let’s dive in. I use that expression all the time, but this time it comes with the extra visualization of a duck dipping below the surface.

01:38 As you might guess, the first step is to import the library. The module defines several functions you can use, including one to directly call SQL.

01:54 This SELECT statement is a little different from the ones I showed you in the previous lesson. It is dynamically generating a result without querying any tables.

02:02 You’ll almost never do this in practice. Just consider this a “Hello World!”-like query that doesn’t require any tables to work. You’re unlikely to do much with SQL unless you actually have data, so let’s get some data.

02:16 The data files I’ll be using throughout this course are available in the supporting material dropdown below the video window.

02:30 Parquet is an open-source column-based data format from the Apache project. Parquet includes data type information and supports compression, so it’s become quite popular in the data science world.

02:41 Let’s look at the result. There you go: 47 rows of American presidents. Let me scroll back up here a bit.

02:51 At the top of the output, you can see the names of the columns and their data types. Like with the actor’s example in the previous lesson, you’ve got first and last name information.

03:00 The result here also contains a sequence number. If you’re coming from other databases, consider this a primary key. If you’re coming from pandas, this is an index.

03:10 Either way, it helps you uniquely identify a row. I’ll just duck back down here. They’ll keep coming. I can’t be stopped. Now let’s examine the result object.

03:23 The .shape property is a tuple telling you how much data you have. The first element is the number of rows, while the second is the number of columns.

03:31 If you’ve used DataFrames before, this should be familiar to you. The column information I showed at the top of the table output is also contained in a property.

03:42 The .columns property has the names while,

03:46 the .dtypes property has the data types. Like with a DataFrame, there are a variety of methods available that let you explore the contents of the object.

03:58 count() counts the number of items in a named column. In this case, the same result would’ve happened regardless of what column I chose as there are 47 rows.

04:10 describe() gives an overview of the contents column by column. Note at the bottom here that it says there are seven columns, but five are shown.

04:18 This is because the result is too wide for my screen. Each row in the result contains a different statistical operation for each of the correspondent columns.

04:27 You already saw count(), while mean(), std(), min(), max(), and median() perform those calcs for each column.

04:34 For some operations that doesn’t make sense. You’ll never need the standard deviation of the sequence column and you can’t do a standard deviation on dates, but the info’s there for everything.

04:49 The first() method returns the named column’s value for the first row.

04:56 While the histogram() method gives a histogram of the data counting the number of instances of each item in the name column. Note that the results of each of these operations has been a summary table, very similar to when I printed out the president’s object itself.

05:10 That’s because both situations return a relation object. For now, I’m showing the evaluation response of a relation object. Later, I’ll show you how to access these programmatically, which is one way to get at those two columns that got truncated off of the screen.

05:26 If you don’t feel like typing SQL, DuckDB provides methods that follow the same logic. For example, you can `select().

05:38 Let me scroll back, like with an SQL SELECT, I gave the column names to be selected. The table part is implicit here as the operation is running on our object.

05:49 Also like calling SELECT as SQL, the result of this function is a relation object, in this case, containing the last and first names from our data.

05:58 I’ll scroll back down. There are other query like operations as well. unique() removes duplicates.

06:08 Notice there are 40 rows in that result, so seven presidents have duplicate last names. Some of those are relatives and some of those are because of gaps between terms.

06:19 The results from the describe() call can also be queried on a column directly. max() returns the biggest value, which for strings is the alphabetically last result.

06:35 min() is the smallest, and sum() is the total. Summing the party_id makes no sense whatsoever, but it was an integer and I wanted to show it. Crazy, I know. You might even call me quackers.

06:52 mean() the average of the party_id is just as silly, but like I said, integers for the demonstration win.

07:00 What did the duck buying lipstick tell the cosmetician? Put it on my bill. Next up, I’ll move from in-memory operations to file-based ones. That and more duck puns. It’ll be foul.

Become a Member to join the conversation.