Using .assign()
00:00
In this lesson, you’ll be using .assign()
to carry out a cleaning operation on your university towns data. The .assign()
method is called on a DataFrame, and it’s used to assign new columns and overwrite existing ones.
00:14
It uses keyword arguments with the column name, so if you have a column called state
, you can, say, within the assign()
function, you can call a keyword argument called state =
and you pass it in a new column. You can also use a name that doesn’t exist yet, and that will create a new column.
00:34
You can also use a function or a lambda function to transform the data. To understand .assign()
a bit better, take a look at this very simplified DataFrame—very similar to what was used before, except this time is instantiated with a dictionary.
00:50
The keys represent the column names, and the values are lists, which represent the values that end up in the table. So you can call .assign()
on the DataFrame directly, and then you can use a keyword argument that is the same as the column name.
01:10
And this will tell .assign()
that you want to replace day
with whatever is here. So you can simply return a list—say we’ll make it 2
and 29
—and that will return a DataFrame with the day
replaced with those values.
01:30
It has to be the same length of values, but there you go. Now this doesn’t modify data
itself. It returns a new DataFrame with the new values.
01:41 So that’s just something to be aware of. Another thing you can do is to pass in a lambda function.
01:51
So it’s important to understand what’s going on with this statement. So we’re going break it down. So you’ve got your DataFrame here, data
, and you’re calling the .assign()
method on it. And to the assigned method, you’re passing a keyword argument, day
, and then you’re passing in the lambda
function.
02:09
df
here can be called whatever you like, but here it’s called df
because it stands for DataFrame
, and it’s because the data
DataFrame is passed in to the lambda
function as its argument. Back to the example … you’ve got the lambda
function that’s getting the data
DataFrame.
02:29
You’re using the DataFrame with the location indexer to grab all the rows for the day
column, and you’re testing to see if any of the values are over 9
.
02:41
So you’re looking at the original DataFrame—this one hasn’t modified it, remember—so this should be False
, and this should be True
.
02:49 So that should replace these two values, as it has. You can also define a function separately somewhere else as, you know, your typical definition and then pass it in here directly.
03:02
You just have to be sure that it accepts a DataFrame. All right. So back to the data … you’ve got everything here and you’re going to chain on an .assign()
function here, and you’re going to be working with state
and town
—those are the columns—and within them, you’re going to use lambda
functions.
03:26
lambda df
will return—for now, just select the DataFrame, the "state"
column, okay,
03:39
and this one "town"
. So right now, all that’s happening is that it’s assigning the state
column to the result of this lambda
function. The lambda
function is passed the DataFrame—the whole DataFrame, as it is at this stage—and then within the function, it selects the "state"
column and returns that to state
.
03:58
So it’s basically taking "state"
and assigning it back to itself. So it should result in an unchanged DataFrame. We look over here.
04:11 As you can see, it’s the same as it was previously.
04:18
Okay. In here, you’ll be wanting to use a str
function, and within here, the intellisense won’t work. You can try it in here, but the intellisense isn’t perfect.
04:29 So sometimes the best thing to do is just to go to the documentation. After some searching, this seems to be the best method, as it’s very precise about what you’re trying to do here.
04:41
You’re trying to remove the suffix, and in this case, the suffix is the "[edit]"
in square brackets string. It just seems to pass it a string, a normal string, no regex magic in here, just a normal string.
04:57
So we can put in .removesuffix()
, and we’ll pass in the string that we want to remove and run this and see if it, if it works. So now go press up (↑) to get the last command, Control + Enter to run this … and good, it seems that none of them contain [edit]
anymore. Perfect.
05:25
For the town
column, you’ll just want to be keeping the town. You don’t want any of this stuff. So just the first part before any kind of space is what you want to include here.
05:39
You’re going to use this method, which is called str.extract()
. It extracts capture groups in the regex pat
(pattern) here as columns in DataFrame
.
05:50 So basically, it goes through each item in the series and extracts a pattern that you want.
05:59
So back here in the cleanup script, str.extract()
, we’ll start with our regex pattern.
06:15
Okay, so what is this doing? The r
is just a way to say that this string is a raw string so that it will interpret all these characters literally. Usually used for regex here, you’re starting your first capture group in the regex with these open and close.
06:32
This is the first capture group, and it’s the only capture group that is defined, so this is the only thing that will be returned, and you’re using any character and any number of any character. Then there’s going to be a space, and you’re literally capturing the first opening bracket ((
).
06:48
If you look back at the data here, every town is followed by a space and an opening bracket, so that will be able to extract that. And then the .extract()
will return the first capture group as a series.
07:13 And as you can see, you have successfully cleaned up this data. Well done.
07:20 In this lesson, you’ve used the .assign()` method to carry out some more complex cleaning operations on your dataset. In the next lesson, you’ll be moving on to the third and final dataset of this course, the books dataset.
Become a Member to join the conversation.