Using DuckDB With Databases

Christopher Trudeau

Starting With DuckDB and Python Christopher Trudeau 05:03

Transcript
Discussion

00:00 In the previous lesson, I started your DuckDB journey by showing you the relation object. In this lesson, you’ll get to do more database-y things and put your SQL knowledge to practice.

00:11 Okay, let’s do some typical databasey-like things. To interact with DuckDB’s file-based mechanism, you need a connection. This is pretty much the same as any database API and like other file-based databases, the connection takes a filename as an argument.

00:27 The resulting connection object then is your API access point for your operations. In the previous lesson, you saw how to get a relation object using the read_parquet() call.

00:38 If you’ve got data like that and you want it in an actual database table, you can do that using the to_table() method. Off to the REPL. I promise I’ll go easy on the duck puns.

00:49 I wouldn’t want you to think I was mallard justed. Like before I start out by importing the duck. This time, I’ll also need a connection. In the supporting material dropdown below the video window, you’ll find a sample code.

01:02 Inside that, there’s a directory called data, which includes the files that I’m going to load. So I don’t confuse files I produce with the files that I’ve loaded.

01:11 I’m putting our output database in a directory called processed. If you’re coding along with me, you’ll need to create that directory before running the following command.

01:26 When you connect using the database argument, DuckDB either loads an existing file or creates it as necessary. All subsequent operations on the connection I’ve just created will be on this newly-created database.

01:40 Since it is new, it’s empty. Let’s get some data.

01:51 This is the same call as in the previous lesson where I’ve loaded the president’s parquet data. I’ve intentionally used a rather long name for the results here to distinguish between the relation object and the things in the database.

02:03 In the real world, you probably wouldn’t bother with this, but I’m hoping to keep it clear when something is in memory and when something is in the table, at least for now.

02:12 DuckDB’s power is the ability to go back and forth between memory and file. This can also be a little confusing. Several times when I was writing the sample code, I failed to write something to the database because I was in memory and I thought I was in a table.

02:25 It’s just something to keep in mind. Confusing or not, it’s powerful. Let’s put some of that SQL to practice.

02:35 Going to select the sequence last_name and first_name columns. Selecting from the relation object. See, that’s cool. Even though I’m in memory, I can still do database-y things.

02:48 And rather than look at all 47 rows, I’m going to restrict it down to a small sequence number.

02:56 And the result is the first two presidents. Now remember, the president’s relation object is an object. It isn’t in the database yet. To put it there, you use the to_table() call.

03:13 This method has created a table named presidents in the database containing all the same stuff from the relation object. When you’re done with a connection, you should close it, especially if you’re writing long-running code. Connections take up memory and you should tidy up after yourself.

03:28 If I were more clever, I’d put a pun about feather dusters in here. To prove that the content got put into the database, let’s connect to it again.

03:44 Same call as before

03:49 and some similar SQL.

03:54 Note the difference. This time I’m selecting from the presidents table, not the relation. Since the relation got converted into a table, you shouldn’t be surprised to see the same data.

04:05 I do have something that might be a surprise though. Close your eyes for a second. No peeking. Get it, like the roast? I’m sure you want to roast me at this point.

04:14 Of course, you can’t really pay attention with your eyes closed, so here it goes.

04:27 And falls down. This call queried the relation object, not the table. Since the relation object was created using the original connection, it can’t be used with this new connection.

04:38 Relation objects are specific to the connection. Don’t close your connection too early if you still need to perform operations in memory.

04:48 So far, I’ve only used the Parquet format. To be productive, seriously, no power in the universe can stop me, you may need to work with other formats. In the next lesson, I’ll show you CSV and why it can occasionally be challenging.

Become a Member to join the conversation.