Slicing and Dicing With .loc
Slicing and dicing? Prefer
.loc, also known as LOC or also known as location indexer. There are many ways in pandas to get a hold of your data and return a slice of your DataFrame, usually in the form of a series, which is meaning that you just grab one column or one row.
There are a few ways that you can do this with pandas. You can use the dot notation (
.), which is a convenience method. With dot notation on the DataFrame, which here is
olympics, you can refer to a column just as if it were an attribute, and this will return the column as a series.
This isn’t the preferred method for selecting your data, mainly because it’s not totally reliable. The DataFrame,
olympics in this case, already has a bunch of attributes because it’s a DataFrame.
00:57 What the dot notation is trying to do is add the column names on top of the attributes it already has. So if you have a column name which is the same name as an attribute, pandas won’t know what you’re referring to and could return one or the other.
01:10 You can’t even count on it to return either always the attribute or always the column. It could return one or the other, depending on the situation. Also, this has the added drawback that you won’t be able to refer to columns that have spaces in their names, because Python attributes can’t have spaces.
Another way to get a column is to use square bracket (
) notation like a dictionary. So on your DataFrame, Olympics, use square brackets here, and you put the name within quotation marks (
"") with your column name. That will return a series of the columns. Like dot notation, the square bracket notation suffers from some of the same problems the dot notation has. pandas has to deduce what you’re trying to refer to and may not always get that right.
Neither are as precise or as reliable as the next method. The
.loc method, which is not really a method at all, but an attribute, is called with dot notation, and this returns a location indexer. This is a class instance of a location indexer.
02:11 You use the location indexer with square brackets, and in the square brackets, you can pass in the labels of rows and columns. The location indexer offers various ways to select your data, and you’ll see many of these ways as you progress through the course.
Since you’re using this special and specific attribute, which points to a class instance of the location indexer, pandas immediately knows that you’re referring to the data within the DataFrame and not the
DataFrame object attributes.
This clarity is particularly obvious when trying to assign values. If you use
.loc to select just one value from the DataFrame and assign that to
1, it will work no problem, whereas if you try to use square bracket notation directly on the DataFrame, you’ll first need to chain another pair of square brackets onto the first pair to select the row within the column. And while assigning it will usually work, it’s not 100% reliable, and pandas will give you warnings about doing this.
So if you wanted to be very explicit about selecting all the rows, you could slice rows
0:146, and this will give you the same result. But as you know, with the slicing object in Python, you can just abbreviate it and not put in anything.
The problem with using the other methods can be seen when trying to assign values to a particular cell in a DataFrame. By cell, I mean the value located at one specific row and column. For example, using square bracket notation on a DataFrame directly to select the
country column, and then selecting row
5, trying to rename this directly as just
Australia, without the initials in brackets, can be done by assigning the result of the expression on the left of the equal sign (
=) to the result of the expression on the right. In this case, this does actually work. But as you can see, pandas is not too happy about it.
05:44 Getting into the detail of why this is would be too much for this course, but the main point this warning is trying to communicate is that the assignment operation might get done in the copy and not get done on the actual DataFrame you’re working with. That is to say, it might get lost.
In the last part of this lesson on
.loc, you’re going to be exploring using Boolean arrays to filter your data. This is a very powerful technique that’s used extensively when working with pandas.
Define a simple DataFrame here. As you can see, it’s defined as an array of arrays, two-dimensional arrays, with two entries or two sub-arrays representing the columns
year. Here, you’ve got New Year’s Eve, 1999, and New Year’s Day, 2000.
Now, if you take just a look at data, you’ll see that it’s a small DataFrame. How does a Boolean array work here with LOC or
.loc? It needs to be a Boolean array that is the same length as what you are going over.
So you often won’t be using it in this way. What you will be trying to do is calling the row again, but passing a conditional statement here. So for example, if you were to take the day, again you’d want to select the whole column here, so we’ll select the whole column of
"day", and then we’ll check if it’s greater than let’s say
08:54 It’s more explicit. It’s more reliable. It’s more flexible. It’s likely to be faster because it’s optimized to do that—it doesn’t have to decide what it’s going do, and there are other reasons it’s faster, but you’re not going get into that on this course.
09:08 And it’s also easier to debug because the error messages it gives you. if you run into errors are more accurate. Again, with dot or square bracket notation, since they are quite overloaded—they can do a whole bunch of things, some not even related to indexing—the error messages can be a bit more opaque.
Become a Member to join the conversation.