Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Filtering and Assigning in a DataFrame

00:00 Alright. This next bit is that un-Python thing that I warned you about. The .loc thing, I’m not comfortable calling it an attribute for some reason, is a way of accessing rows, columns, or splits in a DataFrame.

00:14 It supports a variety of access mechanisms, and the most common of which is just a row index name.

00:21 In this case though, I’m using it to do conditional accessing.

00:25 Before explaining exactly what this line is doing, a brief tangent. Remember from the histogram where Matplotlib has the idea of an edge color and a face color? Well, that’s the line around a shape and what it’s filled with.

00:36 The default when you set a face color is just to use the same as the line color. For our stars, though, that can get a little ugly. Some of the bigger stars in our chart cluster together and it just ends up being a same-colored blob.

00:49 So to distinguish them from each other, I want to have a black outline around the individual stars. The problem with the black outline is that for the smaller stars, you then can’t really make out the fill color.

01:00 So what I want to do is pick a size, and for stars below a certain size, I am going to have the same edge and face color. And for stars above that boundary, I want the edge and face to be different.

01:12 Okay, so that’s what this .loc is for. This line is doing a query for all the values in the radius_sol column, which are bigger than or equal to seven.

01:23 And for each of those values, I’m setting the edge color, that’s the second argument in .loc to “black”, that’s the assignment value.

01:32 This reads weird to me as you’ve got a single assignment, but the thing on the left is a set of rows. So the assignment is happening across those rows. When you pass two arguments to .loc, the first is a row specifier and the second is a column.

01:45 So in this case, the row specifier is a conditional, while a column specifier is just the “edge_color” column, as there isn’t yet an “edge_color” column, pandas creates one.

01:56 Got it? Alright. Let’s see if you did. Time for a pop quiz. Take a second and think through what this call is doing.

02:07 This is similar to the previous line. The row specifier to .loc is a conditional, but this time for rows where the radius_sol column is less than seven.

02:16 The column specifier to .loc is “edge_color”, which now exists because of the previous call, and the assignment to these rows is the corresponding value in the “color” column.

02:27 The reason this works is because all the rows in a DataFrame have an index value. When you do something with a column, it also keeps that same index value.

02:36 So when you assign the df.color column to the subset of rows, pandas is smart enough to only do this for the subset of index values specified in the conditional.

02:47 pandas is really, really powerful. But if you’re used to plain old procedural programming, this can take a little getting used to.

02:55 There’s a fair amount of magic going on behind the scenes here. The good news is once you get used to it, you can do a lot with very little code. The bad news is until you’re used to it, it isn’t quite like reading Python.

03:06 Some of the operations aren’t self-evident. Alright, our data is in place. Now it’s time to graph it.

Become a Member to join the conversation.