Data Cleaning With pandas and NumPy (Overview)

Data scientists spend a large amount of their time cleaning datasets so that they’re easier to work with. In fact, the 80/20 rule says that the initial steps of obtaining and cleaning data account for 80% of the time spent on any given project.

So, if you’re just stepping into this field or planning to step into this field, it’s important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

In this video course, you’ll leverage Python’s pandas and NumPy libraries to clean data.

Along the way, you’ll learn about:

  • Dropping unnecessary columns in a DataFrame
  • Changing the index of a DataFrame
  • Using .str() methods to clean columns
  • Renaming columns to a more recognizable set of labels
  • Skipping unnecessary rows in a CSV file

To get the most out of this tutorial, you should have a basic understanding of the pandas and NumPy libraries, including pandas’ workhorse Series and DataFrame objects, common methods that can be applied to these objects, and NumPy’s NaN values.


Sample Code (.zip)

965.9 KB

Course Slides (.pdf)

608.2 KB

00:00 In this course, you are going to be exploring data cleaning with pandas. Data cleaning is one of the first things you need to do with any dataset. With a library such as pandas, where you have hundreds of functions, methods, and options which you can pass to those functions and methods, it can be a bit overwhelming to get started.

00:20 Sometimes it can be helpful to see someone actually cleaning some data. So in this course, you’ll be looking at cleaning three different datasets. You’ll take them on one by one.

00:31 Each dataset you go through will need more cleaning than the last one. You can take on this course without much pandas experience. You’ll be taken through how to get set up for data cleaning, all the way from setting up your virtual environment to setting up Visual Studio Code with the Jupyter extension.

00:51 This will provide you with a really productive work environment where you can build, piece by piece, a reusable data cleaning script, but also not losing out on the interactivity of something like a Jupyter notebook.

01:05 The first dataset you’ll be looking at is a CSV file of Olympic data. It catalogs the number of medals different countries have won in different Olympic games.

01:14 This data doesn’t need much cleaning, but it’ll be a great example to see how you can set up your reusable data cleaning script and start to explore pandas and your data.

01:27 The second dataset is a list of different towns that have universities in them. However, it’s not a CSV file. It’s just a plain text file, and the format it’s in doesn’t lend itself very well to tabular data—that is, two-dimensional data, like what you’d find in a table. Tabular data is what pandas is good at.

01:47 Taking on these three different datasets one at a time will give you a chance to put into practice what you’ve learned from the previous dataset and be exposed to new and increasingly complex techniques.

02:00 The last and dirtiest dataset that you’ll have to deal with is the books dataset. This includes information such as the date of publication and the place of publication, which have very useful information, but in a very inconsistent format.

02:15 Following along and building out the data cleaning script step by step will provide you with some great techniques for how to structure complex and multiple data cleaning operations, all the while slowly building your knowledge of the pandas API.

02:33 In the next lesson, you’ll be setting up your work environment. So clear up some space on your desktop and open up your command-line application to get into it.

dlautz on June 26, 2022

Great course! Thanks so much for putting it together.

Ian RP Team on June 27, 2022

Thanks, dlautz! Very glad you got something from it :)

Become a Member to join the conversation.