Setting Up for Cleaning

Ian Currie

Data Cleaning With pandas and NumPy Ian Currie 07:49

Throughout the course, you’ll want to keep the pandas documentation handy.

To learn more about the concepts covered in this lesson, you can check out:

00:00 You’ve set up your project, you’ve set up VS Code, and you’ve had an initial look at the data. The last setup step is to write some very basic boilerplate code.

00:11 This code will eventually become your cleanup script. Once you have your boilerplate code loaded, you’ll be able to load the data into a DataFrame. You’ll also be revising what a DataFrame is, and you’ll do some initial data exploration now that your data is in a DataFrame, and you’ll look at some of the methods you can use within pandas to do that initial exploration.

00:34 You’ll want to open the olympics.py file, and to make sure you’ve set up everything correctly, write a comment with two percent signs (%%), and you should see Run Cell.

00:44 Write some very simple code to test this out. You can press Control + Enter to run this, or you can click on Run Cell at the top left. This should open another window, which will connect to the virtual environment,

01:04 and it should print, and the output should be visible below the cell that has just been executed. Now you’re ready to start writing your cleanup script. The first thing you want to do is to import pandas.

01:22 You can import pandas as it is, but usually since you’ll be using a lot, you can just abbreviate that as pd. To learn more about imports, check out some of the links below.

01:32 Now you’ll want all this data cleaning, the reading of the file, and the data cleaning all to be within one function. This is going to be your cleanup function.

01:43 So start by defining a function. Right now, you’re just going to write pass to do nothing as a placeholder for the function. Then below, you’ll read that function and you’ll assign the result of that to a variable.

01:59 So read() will execute, and then pass the cleaned up data back to olympics. This is the general workflow you’ll want to follow.

02:07 You’ll work on filling out the read() function, which will clean up your data. You’ll run this cell, and you’ll see the output here. You can also explore that resulting data by running things here,

02:27 without having to write that in your main script. You want to keep this main script on the left for your pure cleaning code, and then all your exploration you can do on the right here as you would in a Jupyter notebook.

02:42 One of the best resources that you can use are the docs. This panda.pydata.org/docs/ is a great source for everything pandas. The pandas API is huge.

02:55 There’s so many methods and properties and ways of doing things that it’ll make your head spin. This is why the documentation is important to have close at hand at all times.

03:05 It’s impossible for any one person to memorize all the API. You’ll always be having to consult this back and forth as you’re developing your pandas script. As you get more experienced, you’ll be able to remember your most common methods, but you’ll never escape the documentation really.

03:23 So one of the first things you can do here is to search for a way to read CSV files. The search bar is great for that. So just type csv in here and press Enter, and it’ll be searching. Down below here, you’ll see some of the results.

03:44 And here looks like the one that you’re looking for: pandas.read_csv. So click that. Here is the method name at the top. Here is the definition with all the possible different options you can pass in.

04:01 As you can see, there are loads. Again, don’t try and remember all of these. It’s impossible. What you’re looking for is just the basic parameters here, so the file path.

04:14 At the bottom, you’ll usually find some examples. Here is what basic examples. So you can just call the method and pass in a string, which represents the path of the file of the CSV that you’re looking for.

04:31 With that information, you know that you can call pandas, or pd, as has been imported here, and here it immediately suggests read_csv.

04:45 The path here is data-sets/. That’s our folder. And then olympics.csv.

04:54 Now you run this to make sure that it runs well.

05:00 And here you can see that it’s abbreviated the code, and it’s just showing you the first line, and the tick is showing you that that worked. However, you’ve not assigned this.

05:09 You’ve not returned anything here.

05:14 Put in return. Now this will return a DataFrame, and it will assign it to olympics. Clear the interactive window by pressing the button up here on the top left of that tab, Clear All. And then you can run the cell again, which was successful.

05:31 And now one of the things you can do is look at what’s happened here. Now you can run this with the play (▷) button here, or you can press Shift + Enter.

05:44 And as you can see, there is data.

05:50 One of the main methods you can use to summarize data and just to check that you have some data and to look at the first part is to look at olympics and you can call .head(), the method .head() on any DataFrame or series.

06:09 And here you can see, it just shows you the first five entries. There are also other methods like .tail(), which will show you the last entries.

06:22 You’ve loaded some data, and it’s been loaded into a DataFrame. What is a DataFrame? Essentially, a way to represent a table. And that’s two-dimensional data—that is, row and columns or a list of lists, but only two levels deep, so that’s two-dimensional.

06:39 This is pretty much the basis of pandas. pandas is good at working with this kind of data in the table. So if you’re using stuff for Excel or any kind of spreadsheet program, pandas is very, very good at handling this kind of data. As you’ve seen, it has a very rich API. There’s many, many methods that you can use to manipulate your data, using DataFrames, series, and all sorts of other stuff.

07:04 The DataFrame is the main object you’ll be working with. You’ll also be working a lot with series, which are sort of one-dimensional data. That is, it’s like taking a column of your table or a row of your table.

07:16 And if you deal with that in isolation, you’ll usually be working with a series. More on that later. So that was setting up for cleaning: you’ve set up your boilerplate code, you’ve loaded data into a DataFrame, you’ve seen what a DataFrame is, and you’ve done some initial data exploration with the .head() method, for instance, or the .tail() method.

07:38 So now with all the setup out the way, you’re ready to get stuck in to cleaning some data, which you’ll get started in the next lesson by renaming some headers.

Become a Member to join the conversation.