Filtering and Assigning in a DataFrame
00:00
Alright. This next bit is that un-Python thing that I warned you about. The .loc
thing, I’m not comfortable calling it an attribute for some reason, is a way of accessing rows, columns, or splits in a DataFrame
.
00:14 It supports a variety of access mechanisms, and the most common of which is just a row index name.
00:21 In this case though, I’m using it to do conditional accessing.
00:25 Before explaining exactly what this line is doing, a brief tangent. Remember from the histogram where Matplotlib has the idea of an edge color and a face color? Well, that’s the line around a shape and what it’s filled with.
00:36 The default when you set a face color is just to use the same as the line color. For our stars, though, that can get a little ugly. Some of the bigger stars in our chart cluster together and it just ends up being a same-colored blob.
00:49 So to distinguish them from each other, I want to have a black outline around the individual stars. The problem with the black outline is that for the smaller stars, you then can’t really make out the fill color.
01:00 So what I want to do is pick a size, and for stars below a certain size, I am going to have the same edge and face color. And for stars above that boundary, I want the edge and face to be different.
01:12
Okay, so that’s what this .loc
is for. This line is doing a query for all the values in the radius_sol
column, which are bigger than or equal to seven.
01:23
And for each of those values, I’m setting the edge color, that’s the second argument in .loc
to “black”, that’s the assignment value.
01:32
This reads weird to me as you’ve got a single assignment, but the thing on the left is a set of rows. So the assignment is happening across those rows. When you pass two arguments to .loc
, the first is a row specifier and the second is a column.
01:45
So in this case, the row specifier is a conditional, while a column specifier is just the “edge_color” column, as there isn’t yet an “edge_color” column, pandas
creates one.
01:56 Got it? Alright. Let’s see if you did. Time for a pop quiz. Take a second and think through what this call is doing.
02:07
This is similar to the previous line. The row specifier to .loc
is a conditional, but this time for rows where the radius_sol
column is less than seven.
02:16
The column specifier to .loc
is “edge_color”, which now exists because of the previous call, and the assignment to these rows is the corresponding value in the “color” column.
02:27
The reason this works is because all the rows in a DataFrame
have an index value. When you do something with a column, it also keeps that same index value.
02:36
So when you assign the df.color
column to the subset of rows, pandas
is smart enough to only do this for the subset of index values specified in the conditional.
02:47
pandas
is really, really powerful. But if you’re used to plain old procedural programming, this can take a little getting used to.
02:55 There’s a fair amount of magic going on behind the scenes here. The good news is once you get used to it, you can do a lot with very little code. The bad news is until you’re used to it, it isn’t quite like reading Python.
03:06 Some of the operations aren’t self-evident. Alright, our data is in place. Now it’s time to graph it.
Become a Member to join the conversation.