Real Python Logo

Episode 42: What Is Data Engineering and Researching 10 Million Jupyter Notebooks

The Real Python Podcast

Jan 08, 2021 55m

Are you familiar with the role data engineers play in the modern landscape of data science and Python? Data engineering is a sub-discipline that focuses on the transportation, transformation, and storage of data. This week on the show, David Amos is back, and he’s brought another batch of PyCoder’s Weekly articles and projects.

Along with the Real Python article on data engineering, we talk about a project where researchers downloaded 10 million Jupyter notebooks from Github to gather insights about the current state of data science technology.

We also discuss an article about validating data in Python with the package Cerberus. And this led us to a conversation about a set of coding challenges from Advent of Code.

We also cover several other articles and projects from the Python community including, building my own chess engine, the visual guide to NumPy, a free and open-source alternative to SAP, a library for working with STL files and 3D objects, and is Python really a bottleneck?

Topics:

  • 00:00:00 – Introduction
  • 00:01:51 – What Is Data Engineering and Is It Right for You?
  • 00:12:07 – Building My Own Chess Engine
  • 00:17:52 – We Downloaded 10,000,000 Jupyter Notebooks From Github: This Is What We Learned
  • 00:28:12 – Video Course Spotlight
  • 00:29:20 – Is Python Really a Bottleneck?
  • 00:34:01 – Validating Data in Python With Cerberus
  • 00:39:04 – NumPy Illustrated: The Visual Guide to NumPy
  • 00:42:54 – erpnext: Free and Open Source Alternative to SAP
  • 00:48:49 – numpy-stl: Library for Working With STL Files and 3D Objects
  • 00:54:54 – Thanks and goodbye

Show Links:

What Is Data Engineering and Is It Right for You? — In this article, you’ll get an overview of the discipline of data engineering. You’ll learn what is and isn’t part of a data engineer’s job, who data engineers work with, and why data engineers play a crucial role in many industries.

Building My Own Chess Engine — Writing your own chess engine is a great way to explore computational complexity and combinatorial aspects of programming. Not to mention it’s pretty fun! Follow along with this reflection on how one coder created his own Chess engine from scratch.

We Downloaded 10,000,000 Jupyter Notebooks From Github: This Is What We Learned — The JetBrains Datalore team downloaded ten million Jupyter Notebooks and analyzed them to determine things like which languages were the most popular, what kinds of content are in notebook cells, and how consistently notebooks can be reproduced. It’s a fascinating look into trends in data science technology!

Is Python Really a Bottleneck? — Python is slow. From one perspective, that is. But what are the true bottlenecks in the data engineering/data processing space, and how does Python compare to other technologies when those factors are considered?

Validating Data in Python With Cerberus — Thanks to an Advent of Code challenge, author Hector Castro was exposed to the Cerberus Python package for data validation. Get a quick introduction to Cerberus and see Hector’s solution to an Advent of Code challenge in this quick-yet-informative read.

NumPy Illustrated: The Visual Guide to NumPy — This illustrated guide to NumPy is a great way to learn NumPy or brush up on the package. Full of great visual aides, this tutorial covers all the basics and more!

Projects:

Additional Links: