Processing Data Before Loading a DataFrame
00:00 In this lesson, you’ll take the university towns dataset as it is, and you’ll process the data so that you can load it into a DataFrame. Now you have a rough idea of how the data’s structured and what kind of columns you want. At the moment, the stage is that it’s not working to put in a DataFrame, or at least it does work, it just ends up as one column with lots of rows, which isn’t structured at all.
00:23 You want two columns, and you need the states to be sort of copied to each town. The first thing you would need to aim for is a basic list or dictionary has this sort of shape.
00:35
You could have a two-dimensional list or a dictionary with two keys representing the column, each with a list of the values in those columns. All the states have [edit]
next to them in square brackets, and this can be used as a marker in a for
loop when you’re processing the data.
00:53
Every line without the [edit]
is a university town, so you can use this string, "[edit]"
, as a sort of marker for your for
loop to construct this data structure. With this information, you can build a table with states in one column and towns in the other.
01:13 How do we process this data? Let’s set up our boilerplate first.
01:23
Let’s start off with our data structure, which is going to be an empty list, and you’re going to need to open the file with the open()
function.
01:36
And then we’ll take this as file
.
01:43
We’ll explicitly pass in the mode as read ("r"
). Then let’s say print(file.readlines())
to see if that has been successful. Press Shift or Control + Enter to run this.
02:03
Working … and yes, that looks right. So it’s started a list and one line is the state with the [edit]
and then the other line’s … okay, so each line is a different element. So that’s been successful.
02:20
So now you’ll want to start a for
loop, which you can do like this directly. That will automatically give you all the lines in the file. You don’t have to call .readlines()
on that, although you can if you want.
02:33
And then to check if there’s the string in it, you can just write if
the target string,
02:43
using the in
keyword, line, then we will do something. Otherwise, we will do something else. We’ll want to be setting the state as the sort of local state of the application. Don’t get confused by the naming here.
02:59
Saving the name of the state into a variable here so that you can create the sublist to put into towns
that will represent the state and university.
03:12
So here we want something to hold that variable. So we’ll just put state
, and then we can, here, state = line
. We’ll just put the whole line in there.
03:26 Remember we’re not being asked to clean it just yet, but all the same, we’re just going to strip this, which means clear out the whitespace from the start and the end, which is pretty typical—you know, the whitespace, isn’t gonna have much information that they’re gonna need—so we will allow yourself this little bit of cleaning.
03:47
We won’t be removing "[edit]"
from this just yet. And then what you want to do here is to create the list that will contain the state and the line, because if the line doesn’t have "[edit]"
, then you’ll want to just put in the state as you have it, and then the line.
04:05
And you’ll want to append this new list to the towns
list.
04:14
So now let’s return the towns
, save this, run it with Control + Enter or Shift + Enter. And then let’s have a look at towns
.
04:32
Let’s look at it in a text editor because there’s quite a lot here. And as you can see, you’ve got your outer list and then you’ve got your inner list with each inner list has two items, and you’ve got the state, Alabama
, here and all the towns as the second elements.
04:51
Okay, so now what you can do is to see if that works in the DataFrame. So return pd.DataFrame()
. We’ll pass it in the list as it is. And we’ll also say the columns are "state"
and "town"
.
05:14
Save that. Let’s Control + Enter to run it. Go to the bottom here to see if that was successful. Okay, good. And now let’s look at towns.head()
…
05:30
and yes, that’s working. That successfully goes into a DataFrame now. That’s done. What you could do here is, once you have that DataFrame, is call .to_csv()
. Put in the filename here. Let’s call it "towns.csv"
.
05:50 And then run this. That seems to have worked, and here it is. As you can see, it’s included the index, but for now, that looks good. Now you can pass this CSV back to your boss, and he’ll be able to load this directly into a DataFrame.
06:08 So you’ve processed that data for your colleagues, and you’ve loaded it into a DataFrame so that they can use it. Following their instructions, you haven’t cleaned it, but in the next lesson, they’re going to come back to you and ask you to clean it now that it’s in a DataFrame.
Become a Member to join the conversation.