Mark Your DataFrames With Keys
00:00
In this lesson, you look at the optional keys
argument to pd.concat()
that can help you to deal with the situation where you have multiple things named the same as a result from the concatenation, whether that goes by the column axis or by the row axis.
00:18 Now, before I apply this, let’s take another look at the function signature,
00:24
and you can see in here, there’s the keys
argument that you’ll use. And there’s even an explanational docstring that relates to the keys
argument, where it says it can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same or overlapping on the passed axis number.
00:45
Let’s look at that practically by just redoing this exact concatenation, but instead of not passing the keys
argument, you will add it here, and you’ll say keys=
.
01:00
The first one that you’re passing in is the fruits
DataFrame, so let’s just call it fruits
. And the second one is the veggies
DataFrame.
01:08 So you’re just giving the names of the DataFrames. And now when you execute this call, you get a better visual understanding of what do the different columns relate to.
01:19
So as you can see, it correctly shows you that this part belongs to the fruits
DataFrame and this part—well, actually not this part down here, but this part—is the veggies
DataFrame.
01:33
However, still because the third row is missing in the veggies
DataFrame, this is essentially sorted under the columns of the veggies
DataFrame.
01:43
Now, if you turn this around and use the same call, but instead of using axis="columns"
, you use the default of rows, but I’ll put it in explicitly. You could also just skip this whole argument.
01:56
Then you get the same concatenation that you got at the beginning. And again, you’ll see that you have done here—this relates to your veggies
DataFrame, now there’s no NaN
part of this one—but you can see that it’s part of up there because fruits
has one less column.
02:12
This is the fruits
DataFrame, and then here are the NaN
values that it needs to fill to create a full table down here. So this is how you can use the keys
argument to pd.concat()
to give a better idea of where did the data come from and kind of get rid of this ambiguity of having multiple 0
indices for example, or multiple columns of the same name.
02:39
In this lesson, you’ve used the optional keys
argument to pd.concat()
to mark your DataFrames and avoid collisions that come from duplicate label values after concatenation.
02:50
And you did that using pd.concat()
and then passing in the keys
parameter with an iterable, and often you would just put in the names of the DataFrames that you were actually concatenating.
03:01 And this constructs a multi-index DataFrame that you can then also use to access specific parts of the DataFrame despite duplicate labels.
03:12
In the next lesson, you will learn how you can actually access specific data items in such a multi-index DataFrame, which means that you will take a quick break from the different arguments to pd.concat()
, stick with the keys
one, and then just figure out what are the advantages of actually using this and which errors would you run into if you don’t.
Become a Member to join the conversation.