Loading video player…

Starting With DuckDB and Python (Overview)

The DuckDB database provides a seamless way to handle large datasets in Python with Online Analytical Processing (OLAP) optimization. You can create databases, verify data imports, and perform efficient data queries using both SQL and DuckDB’s Python API.

By the end of this video course, you’ll understand that:

  • You can create a DuckDB database by reading data from files like Parquet, CSV, or JSON and saving it to a table.
  • You query a DuckDB database using standard SQL syntax within Python by executing queries through a DuckDB connection object.
  • You can also use DuckDB’s Python API, which uses method chaining for an object-oriented approach to database queries.
  • Concurrent access in DuckDB allows multiple reads but restricts concurrent writes to ensure data integrity.
  • DuckDB integrates with pandas and Polars by converting query results into DataFrames using the .df() or .pl() methods.
Download

Course Slides (.pdf)

1.4 MB
Download

Sample Code (.zip)

13.8 KB

00:00 Welcome to Starting with DuckDB in Python. My name is Christopher, and I will be your guide. This course is all about DuckDB, a popular open-source database.

00:10 Along the way, you’ll be learning about how to use DuckDB as an in-memory database through Python, using it as a more traditional file-based database, and just enough SQL to get you going, along with how to incorporate Python functions into your Duck DB calculations.

00:29 DuckDB is a third-party library. As always, best practice is to use a virtual environment to do this, do pip install duckdb, or whatever tool you like to use to do such things.

00:41 The code in this course was tested with DuckDB version 1.20.2, and Python 3.13.3. I’m not using anything 3.13 specific though, so older supported versions of Python should be fine.

00:55 DuckDB, I promised I won’t do that every time, is an open-source database. It’s column-oriented, which means it shares some aspects with DataFrame libraries like pandas and Polars.

01:07 In fact, it has built-in calls to integrate with both those libraries, and although it is a relational database, its focus is a little different than your typical Postgres or SQL Server.

01:18 It isn’t meant to be a shared backend on a server, but a tool for doing data analytics. If you use a DataFrame library and sometimes wish you could just use SQL to solve a problem, then DuckDB—seriously ignore me, I’ll get bored and stop eventually—might be the solution for you.

01:35 The database whose name I will now skip can be used directly like SQLite or through a programming language, and that’s pretty much any language you want, as it integrates with over 20 of them.

01:47 They divide the support into three tiers, and how quickly features and fixes get applied depends on the tier. Python is one of the ten primary support tier languages.

01:59 One result of the data analytics focus is that both in-memory and traditional file-based database operations are supported. Another result of that focus is the integration with a wide variety of data formats.

02:10 In this course, I’ll be using CSV, JSON, and Parquet.

02:15 Before diving into DuckDB, I behaved myself that time, you need to know a bit of SQL. This isn’t a SQL course and the only stuff I’ll be using in practice is the SELECT statement.

02:27 You can probably get along with just reading what’s on the screen, but in case you need a refresher, I’ll give you a quick intro next. If you’re comfortable with basic SQL, feel free to skip to lesson three.

Become a Member to join the conversation.