Episode 142: Orchestrating Large and Small Projects With Apache Airflow
The Real Python Podcast
Jan 27, 2023 54m
Have you worked on a project that needed an orchestration tool? How do you define the workflow of an entire data pipeline or a messaging system with Python? This week on the show, Calvin Hendryx-Parker is back to talk about using Apache Airflow and orchestrating Python projects.
Episode Sponsor:
Calvin is the co-founder and CTO of Six Feet Up and a Python Web Conference co-organizer. He’s recently been working on a massive project that requires thousands of jobs involving transferring and transforming data. Through his research into orchestration systems, he found Apache Airflow.
Airflow is an open-source tool to define, schedule, and monitor workflows. The platform is pure Python and integrates with a wide variety of services. We discuss how workflows are defined by creating directed acyclic graphs (DAG).
Calvin talks about how a recent project outgrew the system and how his team built a clever solution using Python. We also discuss the upcoming Python Web Conference and what virtual attendees can expect.
Course Spotlight: Python Basics: Object-Oriented Programming
In this video course, you’ll get to know OOP, or object-oriented programming. You’ll learn how to create a class, use classes to create new objects, and instantiate classes with attributes.
Topics:
- 00:00:00 – Introduction
- 00:02:24 – Describing the large data pipeline
- 00:04:38 – What format was the data in?
- 00:06:04 – Was the format of the data changed for storage?
- 00:09:34 – Data engineering and describing sources and targets
- 00:11:29 – Apache Airflow orchestration and hitting limitations
- 00:18:12 – Sponsor: CData Software
- 00:18:54 – DAG: Directed acyclic graphs
- 00:22:29 – Streaming data and other tool choices
- 00:25:38 – Overcoming DAG Factory limitations
- 00:31:49 – Another industry example for Airflow
- 00:34:24 – Finding solutions as a consultancy
- 00:35:12 – Is there a minimum-size project for Airflow?
- 00:37:37 – Django under the hood
- 00:38:31 – Video Course Spotlight
- 00:39:58 – The Python Web Conference 2023
- 00:44:24 – Do you have any upcoming conference talks?
- 00:45:53 – How can people follow your work online?
- 00:46:52 – IndyPy talk by Mariatta Wijaya
- 00:48:01 – What are you excited about in the world of Python?
- 00:51:45 – What do you want to learn next?
- 00:53:22 – Thanks and goodbye
Show Links:
- Apache Airflow - Documentation
- Too Big for DAG Factories? — Six Feet Up
- Directed acyclic graph - Wikipedia
- DAGs — Airflow Documentation
- Dynamically generating DAGs in Airflow - Astronomer Documentation
- Data Lakehouse Architecture and AI Company - Databricks
- Episode #10: Python Job Hunting in a Pandemic – The Real Python Podcast
- Episode #124: Exploring Recursion in Python With Al Sweigart – The Real Python Podcast
- The Recursive Book of Recursion
- Episode #61: Scaling Data Science and Machine Learning Infrastructure Like Netflix – The Real Python Podcast
- IndyPy — Indiana Python User Group
- Contributing to Python - Mariatta Wijaya - Python Core Developer - YouTube
- Home Assistant
- Arturia - MicroFreak
- Arturia - Pigments
- CalvinHP (@calvinhp@fosstodon.org) - Fosstodon
- calvinhp - Twitter
- Six Feet Up - Blog
- Python Web Conference 2023