Slicing and Dicing With .loc[]

Data Cleaning With pandas and NumPy Ian Currie 09:38

Transcript
Discussion

00:00 In this lesson, you’ll be exploring the Olympic data you’ve just cleaned with the location indexer, the most flexible and most reliable way to explore your data.

00:10 Slicing and dicing? Prefer .loc[], also known as LOC or also known as location indexer. There are many ways in pandas to get a hold of your data and return a slice of your DataFrame, usually in the form of a series, which is meaning that you just grab one column or one row.

00:28 There are a few ways that you can do this with pandas. You can use the dot notation (.), which is a convenience method. With dot notation on the DataFrame, which here is olympics, you can refer to a column just as if it were an attribute, and this will return the column as a series.

00:46 This isn’t the preferred method for selecting your data, mainly because it’s not totally reliable. The DataFrame, olympics in this case, already has a bunch of attributes because it’s a DataFrame.

00:57 What the dot notation is trying to do is add the column names on top of the attributes it already has. So if you have a column name which is the same name as an attribute, pandas won’t know what you’re referring to and could return one or the other.

01:10 You can’t even count on it to return either always the attribute or always the column. It could return one or the other, depending on the situation. Also, this has the added drawback that you won’t be able to refer to columns that have spaces in their names, because Python attributes can’t have spaces.

01:27 Another way to get a column is to use square bracket ([]) notation like a dictionary. So on your DataFrame, Olympics, use square brackets here, and you put the name within quotation marks ("") with your column name. That will return a series of the columns. Like dot notation, the square bracket notation suffers from some of the same problems the dot notation has. pandas has to deduce what you’re trying to refer to and may not always get that right.

01:55 Neither are as precise or as reliable as the next method. The .loc[] method, which is not really a method at all, but an attribute, is called with dot notation, and this returns a location indexer. This is a class instance of a location indexer.

02:11 You use the location indexer with square brackets, and in the square brackets, you can pass in the labels of rows and columns. The location indexer offers various ways to select your data, and you’ll see many of these ways as you progress through the course.

02:26 Since you’re using this special and specific attribute, which points to a class instance of the location indexer, pandas immediately knows that you’re referring to the data within the DataFrame and not the DataFrame object attributes.

02:39 This clarity is particularly obvious when trying to assign values. If you use .loc[] to select just one value from the DataFrame and assign that to 1, it will work no problem, whereas if you try to use square bracket notation directly on the DataFrame, you’ll first need to chain another pair of square brackets onto the first pair to select the row within the column. And while assigning it will usually work, it’s not 100% reliable, and pandas will give you warnings about doing this.

03:09 You’ll get to experiment with that later in this lesson. So just to sum up, use .loc[] (location indexer) for the most error-free experience going forward.

03:20 Looking at the cleaned up headers of our Olympic data—can run this with Control + Enter—and then you get the head, so we have something to work from.

03:34 So let’s say you wanted just the country column here. One way to do this is to directly reference it as an attribute.

03:47 And this will give you this series. You can also do this as if it were a dictionary.

03:59 And you can also do this with the location indexer.

04:07 What is this colon (:) doing here? The location indexer takes two arguments: first the row, and then the column. Each of these arguments can be a slice. The colon is part of the slice syntax.

04:22 So if you wanted to be very explicit about selecting all the rows, you could slice rows 0:146, and this will give you the same result. But as you know, with the slicing object in Python, you can just abbreviate it and not put in anything.

04:39 If you just put in the colon, it implicitly means that you should select everything, which is why this works.

04:47 If you just wanted a slice, say the middle, 100:110, you can see that you get those ones here. In general, you always want to use the location indexer for these type of things.

05:02 The problem with using the other methods can be seen when trying to assign values to a particular cell in a DataFrame. By cell, I mean the value located at one specific row and column. For example, using square bracket notation on a DataFrame directly to select the country column, and then selecting row 5, trying to rename this directly as just Australia, without the initials in brackets, can be done by assigning the result of the expression on the left of the equal sign (=) to the result of the expression on the right. In this case, this does actually work. But as you can see, pandas is not too happy about it.

05:44 Getting into the detail of why this is would be too much for this course, but the main point this warning is trying to communicate is that the assignment operation might get done in the copy and not get done on the actual DataFrame you’re working with. That is to say, it might get lost.

06:02 In the last part of this lesson on .loc[], you’re going to be exploring using Boolean arrays to filter your data. This is a very powerful technique that’s used extensively when working with pandas.

06:14 Define a simple DataFrame here. As you can see, it’s defined as an array of arrays, two-dimensional arrays, with two entries or two sub-arrays representing the columns day, month, and year. Here, you’ve got New Year’s Eve, 1999, and New Year’s Day, 2000.

06:36 Now, if you take just a look at data, you’ll see that it’s a small DataFrame. How does a Boolean array work here with LOC or .loc[]? It needs to be a Boolean array that is the same length as what you are going over.

06:53 So say you just wanted the first row. You could pass in True and False. And that will just return the first row, as you can see here.

07:05 But if you pass in True and True, then you’ll get both of them. Likewise, if you pass in False in the first one, you’ll just get the second row.

07:15 The same is true of the columns. So if you pass in True—since there are three columns, you need three values—

07:25 you’ll just get the day and the year here.

07:29 So you often won’t be using it in this way. What you will be trying to do is calling the row again, but passing a conditional statement here. So for example, if you were to take the day, again you’d want to select the whole column here, so we’ll select the whole column of "day", and then we’ll check if it’s greater than let’s say 5.

07:58 So what’s happening here? You’re selecting the whole of the column "day", which is two rows, and you’re checking if each of them is over 5.

08:08 So maybe look at that in isolation first, and that’s going to give you a series of False, True, because only the second value in this DataFrame is over 5.

08:22 So if you just take this whole thing and you pass it into

08:29 a data.loc[] expression, then it’s like passing this same array, so you will only get the second value here.

08:39 So this comes in very handy for filtering your data later, when you’re doing more exploration.

08:48 Why should you use .loc[]? It can do everything that the dot notation (.) or the square bracket notation ([]) can do.

08:54 It’s more explicit. It’s more reliable. It’s more flexible. It’s likely to be faster because it’s optimized to do that—it doesn’t have to decide what it’s going do, and there are other reasons it’s faster, but you’re not going get into that on this course.

09:08 And it’s also easier to debug because the error messages it gives you. if you run into errors are more accurate. Again, with dot or square bracket notation, since they are quite overloaded—they can do a whole bunch of things, some not even related to indexing—the error messages can be a bit more opaque.

09:27 Now that you’ve done some slicing and dicing with the location indexer, in the next lesson, you’ll start a new section and a new dataset, the university towns dataset.

Become a Member to join the conversation.