Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Exploring the University Towns Dataset

Ian Currie

Data Cleaning With pandas and NumPy Ian Currie 02:22

Transcript
Discussion

00:00 In this lesson, you’ll get an introductory look at the university towns dataset, getting an idea of what needs to be done here to clean it.

00:10 For this dataset, imagine yourself in a scenario: you are a working data engineer, and one day you get given a plain text document. So that’s not CSV, it’s not structured, it’s just plain text.

00:23 Your boss says it can’t be loaded directly into a DataFrame. They’ve tried, but it wasn’t successful. You’ve been asked to make a CSV file with the data, and your boss says not to clean it, just divide it into columns. Okay, so, first step is to take a look at the data that we’re dealing with.

00:44 And here it is in all this glory. As you can see, there’s just values divided by newlines. Well, that’s promising. At least there are newlines that you can divide things on.

00:56 However, it’s not immediately clear what columns you’re meant to put things in. Take this line by line, and first Alabama. Okay, so that’s a state, and it has this [edit] in there. And then by highlighting it, I can see that there’s a bunch of other ones that have [edit] next to them: Alaska, Arizona, Arkansas, California—these all seem like states—Colorado. Okay.

01:23 And then afterwards, you’ve got a town

01:28 and a university usually after them. So yes, almost all of them that don’t have the [edit] in them have university. So the columns that you’ll want to make are, say, one column for the state and one column for the town.

01:47 What would it be, in this sense, is that the first rows would be Alabama | Auburn, Alabama | Florence, Alabama | Jacksonville, Alabama | Livingston, and so on until it gets to Alaska, where it would be Alaska | Fairbanks.

02:01 This seems like the way that this data will need to be processed.

02:07 Now that you’re somewhat familiar with the university towns dataset and what needs to be done, in the next lesson, you’ll look at processing the data before loading a DataFrame, because as it is, this data cannot be loaded directly into our DataFrame.

Become a Member to join the conversation.