Real Python Podcast Episode #135 Title Artwork

Episode 135: Preparing Data to Measure True Machine Learning Model Performance

The Real Python Podcast

Dec 02, 2022 57m

How do you prepare a dataset for machine learning (ML)? How do you go beyond cleaning the data and move toward measuring how the model performs? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to talk about strategies for better ML model performance.

Episode Sponsor:

Jodie starts by defining some terms for the conversation. We talk about targets, features, and supervised learning.

We discuss three common ways that data can alter model performance and which Python tools can help spot and avoid them. Jodie shares personal experiences of working through these pitfalls. We also share a healthy collection of resources to explore and learn more.

Topics:

  • 00:00:00 – Introduction
  • 00:01:46 – Recent conference talks
  • 00:03:24 – How to prepare your data for model performance
  • 00:04:24 – Vocabulary: target, features, and supervised learning
  • 00:06:28 – The curse of dimensionality
  • 00:08:57 – Overfitting
  • 00:11:08 – Underfitting
  • 00:12:11 – Splitting the dataset
  • 00:13:39 – K-fold cross validation
  • 00:18:30 – Data leakage
  • 00:21:36 – Checking for duplicates
  • 00:26:23 – Applying transformations only after splitting data
  • 00:31:16 – Imbalanced data
  • 00:36:36 – Using ML to balance data
  • 00:41:05 – Informing your model of the imbalance
  • 00:42:56 – Video Course Spotlight
  • 00:44:20 – Accuracy used as a measure
  • 00:49:05 – Scikit-learn method classification_table
  • 00:50:43 – Jet Brains blog post and conference talk
  • 00:52:18 – How can people follow your work online?
  • 00:54:39 – Upcoming webinars
  • 00:56:20 – Thanks and goodbye

Show Links: