Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Joining Data

00:00 In the previous lesson, I showed you how to programmatically access the data in a relation object. In this lesson, I’ll be querying data across multiple tables with an SQL join.

00:11 A SELECT statement can build a result set by querying across multiple tables. This is known as a JOIN. When you add a JOIN to a SELECT statement, you need to specify which tables are participating in the join, and what condition or conditions to use to relate the rows between the tables.

00:29 This mechanism is often used to express one -to-many relationships in your data. For example, a book can have multiple authors. If you have a book table with an author column, you wouldn’t be able to express this.

00:42 You’d be stuck with however many author columns you had. You could have three author columns, but what if the book has four authors? Also, what if you want to store more information about the author. Instead, the structure you use in this case is a table for books with each book having a unique ID, and a table for authors, who also get their own ID.

01:04 By having separate tables, you can have all sorts of information that are specific to each thing. The ISBN number could go with the book, while the date of birth could go with the author.

01:15 To express the relationship between a book and its authors, you would then need a third table. This table would contain a column for a book’s ID and a column for an author’s ID.

01:26 A book with multiple authors would be expressed by multiple rows in this relationship table, one for each book-author pair. With this kind of table structure in place, you would use a join operation to query which authors are associated with any given book or books.

01:44 Although the three-table structure that I just mentioned is common, it’s a little complex, so I’m going to use a slightly simpler example. On screen here is a simplified version of our presidents data, along with a new table that contains the political parties information.

02:00 You may recall that our president table had a column called party_id. The party table then has a column which uses that same identifier. To query the president and their party name, you need a join.

02:17 The SELECT portion is similar to what you’ve seen before. Just note that the columns in the SELECT can be from any of the tables participating in the join.

02:28 After the JOIN keyword, you provide the other tables participating. In this case, it’s the parties table. The ON keyword dictates how to relate the data between the two tables.

02:41 In our case, the party_id is common between them, so when the party_id on presidents matches the party_id on the party’s table, the data gets joined.

02:51 The resulting data is each president with the name of their party. Joins can get much more complicated than this, having more than two tables as well as different rules for how to match data on the left to data on the right, this is really just a taste, a tiny little breadcrumb tossed in a pond.

03:08 Let’s go see this in practice. Same import, same data,

03:18 and a quick query to remind you what’s in there. You may recall from a previous lesson, I used the limit method to restrict how many rows got shown in the result.

03:27 The reason it’s called LIMIT rather than HEAD, is because LIMIT is an SQL keyword. I’ve used it in my query here to return just the first two results.

03:36 Alright, that’s our presidents data. Now let’s go get the party info.

03:45 Parquet and CSV are so 10 minutes ago. This time the data’s in JSON. Note, I’ve chained the to_table() call here on the end.

03:59 So far, so familiar. Okay, how about a quick sanity check? How do you know what’s in your database? Well, the database itself contains metadata about the database, which you can query.

04:15 The special duckdb_tables table tells us there are two tables in the database. Okay, let’s construct our join.

04:37 So far, this is just like the example I showed you. Let’s add a little orange glaze to our duck.

04:45 Rather than seeing everyone, I’ve restricted the results to just the Whig party. This could also have been done with the party_id on either table if you knew the ID for the Whigs.

05:00 The ORDER BY clause specifies the sorting of the results. It takes an argument as to which column in the results to perform the sorting upon.

05:07 The DESC keyword, pronounced descending, means descending, specifying the sort order being that direction around. And there you go. There were four presidents from the Whig party.

05:22 How funny is that? One of them’s named Millard. I swear to you, I didn’t plan that. You may recall I said it was good practice to close your connection when you’re done with it.

05:32 This is me doing that. To come full circle, let’s see the same idea using in-memory operations.

05:45 Note, I’m not using a connection here. The same read methods are available directly on the module.

05:55 I’ve got both presidents and parties as relation objects. Remember, these aren’t tables. The connection’s closed. Python no longer knows about those tables at all.

06:06 Queries can be performed on relation objects as if they were tables by using the sql() function directly on the DuckDB module.

06:27 Same query, same results, but this time it was done on the memory objects instead. I’ve been fast and loose with the names here, using presidents to mean a table earlier and a relation now.

06:41 You can almost think of them as the same things as you can perform queries on either, but make sure you don’t confuse them. The relation only lives as long as this REPL session, whereas the table can be accessed again by connecting to the database once more.

06:55 Let’s do this one more time in a third way, this time through method calls.

07:05 There’s a little bit of housekeeping necessary before using the API. You need to give DuckDB a name to associate with your relation object for use inside the API calls.

07:20 You can think of this like registering the object with a name. This name doesn’t have to be the same as the name of the variable you’re using to store the relation object, but typically that’s what you do.

07:30 Now, let me construct the same query as before, but this time using chained method calls. The reason for the aliases will be clearer in a second.

07:39 Starting with the president’s relation object,

07:48 and chaining a join() method. This takes the party’s relationship object as an argument and a condition for the joining. The condition is a string. Inside the string, you need to reference the objects, hence the alias.

08:02 DuckDB parses the condition like it does inside an SQL statement, and the alias is what tells it to associate those names in the condition with the relation objects.

08:13 This is kind of messy. If you’re used to ORMs like SQLAlchemy or Django, this feels kind of fragile. Both those ORMs have ways of writing this kind of expression with objects instead.

08:24 But hey, it works. With the tables joined, the select() method specifies the names of the columns for our result, and then the filter() call enacts the WHERE clause.

08:37 Finally, a call to order() to sort the results, and there you go. Third way, same result. The API method’s a little more verbose than writing SQL queries directly, but from your IDE’s perspective, SQL is just a string.

08:52 Using the API like this gives you type hinting and compiler enforcement, making bugs a little less likely. When done right, it also helps deal with SQL injection.

09:02 I won’t go into that here. The short version is if a user is passing you strings that you’re basing your query upon, they could include stuff that mangles your SQL.

09:10 And when I say mangle, I mean vicious things like deleting data or tables. For a little more information on this, Google OWASP, that’s O-W-A-S-P SQL injection attacks for a little bit of homework.

09:25 Why did the mallard want to be an accountant? Because he loved deductions. Tip your waitress. I’ll be here all week. Next up, how to incorporate Python functions into your DuckDB queries.

Become a Member to join the conversation.