In this lesson, you’ll be using
.assign() to carry out a cleaning operation on your university towns data. The
.assign() method is called on a DataFrame, and it’s used to assign new columns and overwrite existing ones.
It uses keyword arguments with the column name, so if you have a column called
state, you can, say, within the
assign() function, you can call a keyword argument called
state = and you pass it in a new column. You can also use a name that doesn’t exist yet, and that will create a new column.
You can also use a function or a lambda function to transform the data. To understand
.assign() a bit better, take a look at this very simplified DataFrame—very similar to what was used before, except this time is instantiated with a dictionary.
The keys represent the column names, and the values are lists, which represent the values that end up in the table. So you can call
.assign() on the DataFrame directly, and then you can use a keyword argument that is the same as the column name.
And this will tell
.assign() that you want to replace
day with whatever is here. So you can simply return a list—say we’ll make it
29—and that will return a DataFrame with the
day replaced with those values.
It has to be the same length of values, but there you go. Now this doesn’t modify
data itself. It returns a new DataFrame with the new values.
01:41 So that’s just something to be aware of. Another thing you can do is to pass in a lambda function.
So it’s important to understand what’s going on with this statement. So we’re going break it down. So you’ve got your DataFrame here,
data, and you’re calling the
.assign() method on it. And to the assigned method, you’re passing a keyword argument,
day, and then you’re passing in the
df here can be called whatever you like, but here it’s called
df because it stands for
DataFrame, and it’s because the
data DataFrame is passed in to the
lambda function as its argument. Back to the example … you’ve got the
lambda function that’s getting the
You’re using the DataFrame with the location indexer to grab all the rows for the
day column, and you’re testing to see if any of the values are over
So you’re looking at the original DataFrame—this one hasn’t modified it, remember—so this should be
False, and this should be
02:49 So that should replace these two values, as it has. You can also define a function separately somewhere else as, you know, your typical definition and then pass it in here directly.
You just have to be sure that it accepts a DataFrame. All right. So back to the data … you’ve got everything here and you’re going to chain on an
.assign() function here, and you’re going to be working with
town—those are the columns—and within them, you’re going to use
lambda df will return—for now, just select the DataFrame, the
"state" column, okay,
and this one
"town". So right now, all that’s happening is that it’s assigning the
state column to the result of this
lambda function. The
lambda function is passed the DataFrame—the whole DataFrame, as it is at this stage—and then within the function, it selects the
"state" column and returns that to
So it’s basically taking
"state" and assigning it back to itself. So it should result in an unchanged DataFrame. We look over here.
04:11 As you can see, it’s the same as it was previously.
Okay. In here, you’ll be wanting to use a
str function, and within here, the intellisense won’t work. You can try it in here, but the intellisense isn’t perfect.
04:29 So sometimes the best thing to do is just to go to the documentation. After some searching, this seems to be the best method, as it’s very precise about what you’re trying to do here.
You’re trying to remove the suffix, and in this case, the suffix is the
"" in square brackets string. It just seems to pass it a string, a normal string, no regex magic in here, just a normal string.
So we can put in
.removesuffix(), and we’ll pass in the string that we want to remove and run this and see if it, if it works. So now go press up (↑) to get the last command, Control + Enter to run this … and good, it seems that none of them contain
 anymore. Perfect.
town column, you’ll just want to be keeping the town. You don’t want any of this stuff. So just the first part before any kind of space is what you want to include here.
You’re going to use this method, which is called
str.extract(). It extracts capture groups in the regex
pat (pattern) here as columns in
05:50 So basically, it goes through each item in the series and extracts a pattern that you want.
So back here in the cleanup script,
str.extract(), we’ll start with our regex pattern.
Okay, so what is this doing? The
r is just a way to say that this string is a raw string so that it will interpret all these characters literally. Usually used for regex here, you’re starting your first capture group in the regex with these open and close.
This is the first capture group, and it’s the only capture group that is defined, so this is the only thing that will be returned, and you’re using any character and any number of any character. Then there’s going to be a space, and you’re literally capturing the first opening bracket (
If you look back at the data here, every town is followed by a space and an opening bracket, so that will be able to extract that. And then the
.extract() will return the first capture group as a series.
07:13 And as you can see, you have successfully cleaned up this data. Well done.
07:20 In this lesson, you’ve used the .assign()` method to carry out some more complex cleaning operations on your dataset. In the next lesson, you’ll be moving on to the third and final dataset of this course, the books dataset.
Become a Member to join the conversation.