Filtering With .where() and .filter()
00:00
Another method that you may be interested in is called .where()
. The .where()
method on a DataFrame—it’s going to replace values in the DataFrame or in your Series or whichever one you’re working with.
00:13
It’s going to replace values where the condition is False
. So, for example, let’s take a look at the 'django-score'
or the 'js-score'
.
00:23
What we’re going to do is we’re going to have a condition of, say, at least 80. And so the .where()
method takes on a cond
(conditional) keyword argument.
00:33
This is going to be the condition that we want to apply. This is going to be one of these Boolean series, so here we’re going to have 'js-score'
of at least, say, 80
.
00:45
That’s going to be our Boolean series. And so in this js-score
, if any of the values is not greater than 80
, it’s going to be replaced by some other value that we want, and that’s passed in by this other
keyword argument.
01:01
We’re going to set all of the JS scores that have a value that’s less than 80
to a 0.0
value. Let me run that here. Here, that’s what happens.
01:12
So whenever you need to change or want to change a value in either a Series or in several columns in a DataFrame to some other type of default value when a certain condition isn’t satisfied, then your best bet is to use the .where()
method.
01:28
If you’re familiar with NumPy, you know that there’s a similar method in the numpy
module called where()
, which pretty much does the same thing.
01:36
The function will replace a value with some given value, provided a condition is False
.
01:43
Now, there happens to be a method on a DataFrame that’s actually called .filter()
, but this method doesn’t filter a DataFrame on its content.
01:52 Instead, it filters it on the labels. So, for example, let’s suppose that we had a really large DataFrame that had columns that we were interested in, but it was too difficult for us to type out all of the columns.
02:08
But we knew, for instance, that all of the columns that we were interested in contained, say, the "score"
string. What we could do is, using the DataFrame, we’re going to call the .filter()
method.
02:21
And the .filter()
method can either be used by taking in a keyword argument called items
… The items
keyword argument would accept the names of the columns that we wanted to filter out. So, for example, we could filter out the "py-score"
columns and the "js-score"
columns, and also the "django-score"
.
02:44
Now, we know how to do this separately if we were to just use the accessor methods, but this is another way to do that. So again, this is just going to pull out the columns that have the score. Now, if you had a lot of columns that were named "score"
, what you could do instead is, instead of passing in the items
keyword argument, you would pass in the like
keyword arguments.
03:08
And what this is going to be doing is you’re going to pass in a string and then it’s going to keep the columns that contain whatever string you’re passing in in the name. So for example, we know that all of the columns that have a score in either the Python score, that JS score, or the Django score have, as a substring, "score"
.
03:29
And so what this is going to do is keep the columns that have "score"
in their title name. In that case, we’re going to get the exact same columns.
03:40
Now by default, the axis where the labels are going to be tested with this is going to be columns for a DataFrame. And so in this case, the axis
keyword is actually set to 1
, and so this would give us that. Now, if we set this to 0
, we’re probably going to get an empty DataFrame because, of course, none of the labels in our DataFrame contain the string "score"
.
04:07
One other last thing to note is that instead of either items
or like
, you can also pass in a regular expression. If you’re familiar with regular expressions, in this case, if we just wrote in "score"
, this would be a substring of the column labels for the "py-score"
, the "django-score"
, and the "js-score"
.
04:27 But if you had a more complicated name for the columns that satisfied a certain pattern, then you can use regular expressions as well.
04:36 Let’s do a recap of the different ways to filter rows and columns from a DataFrame.
04:42 We started off with filtering data using comparison operators, and the basic idea there is that you want to extract rows in the DataFrame where the value in a certain column satisfies a certain condition.
04:55
So in this example, we are focusing on the 'py-score'
column. We want to know which values there have at least a value of 80
. And then we saw that this returns a pandas Series
object of True
and False
values. And whenever we have a True
, those are the rows that are extracted, so the corresponding row where that pandas Series has a value of True
.
05:19 And so that’s a basic way to extract rows under a certain condition on a column—or you can use more than one column if necessary.
05:27
A way to do this is to use logical operators. So for example, if we also wanted to pull out all of those rows that had 'py-score'
of at least 80
and a 'js-score'
of at least 80
, then we use the bitwise AND operator (&
). The bitwise AND operators, they are overwritten in pandas so that they can work in an element-wise way.
05:49
And so in this case, both of these pandas Series
objects are compared element-wise, and whenever both particular elements are True
, the resulting pandas Series at that element has a truth value. And in that case, those are the rows that we’re going to extract from the main DataFrame.
06:07
And then you took a look at this .where()
method, where you pass in a certain condition. And again, this is going to be a pandas Series True
, False
array.
06:17
And whenever we have a False
value in a particular element in that Series
object, we’re going to replace the corresponding row in whatever column that we are specifying with a certain value that we want to set.
06:32
And so instead of filtering out and then setting the values, you can do this directly using the .where()
method.
06:41
And then a powerful method that I think is really useful is this .filter
method, where instead of filtering out data in terms of the rows, you filter out the columns.
06:51
So if you only want to keep columns where the name of the column satisfies a certain condition, then the .filter()
method is the way to go.
06:58
And if you combine this, say, with regular expressions—if you have a really large DataFrame and the columns have this sort of pattern that you want to match and you only want to keep those—then you could use a regular expression or the like
keyword argument, where if the columns that you’re interested in had a certain substring, you can use the like
keyword argument. All right!
07:20 So, that’s a rundown there on the different ways that you can filter either rows or columns from a pandas DataFrame. Up next, we’ll see how you can compute basic statistics on the columns of a DataFrame that contain numerical values.
Become a Member to join the conversation.
rrrnau on Nov. 3, 2022
Hi, I see that .filter() doesn’t allow multiple conditions on “like” nor on “regex”. Actually, they’re mutually exclusive, so it’s only one condition at a time. Is there a way to filter on columns (or rows, for that matter) that allow multiple conditions?
Thanks in advance!