Join DataFrames Using Index Columns
00:00
In the previous lesson, you learned how you can specify the columns that you want to join your two DataFrames on using the on
parameter for pd.merge()
.
00:10 I also mentioned that you can join DataFrames on their index columns as well. Let’s take a look at the function signature to see how you can do that.
00:20
Over here in my Jupyter notebooks, I’m going to bring up the docstring and then scroll down to the additional parameters that you have to pd.merge()
.
00:31
Now we’ve looked at the how
parameter and a couple of the options that you have in there, and then you’ve also looked at on
,
00:39
and now there’s a couple of other options. You have left_on
, which allows you to specify which columns or combination of columns to use in the left DataFrame that you want to combine with the right DataFrame.
00:51
So you can mix and match here. And then you also have the options of left_index
and right_index
, which default to False
, both of them, but by setting these ones to True
, you can ask pandas to use the index columns instead of some of the named columns for performing the join.
01:14
I will give it a try in here by using both index columns at first. So in the fruits
DataFrame, it looks like this, and you can see it has an index column that starts at 0
and numbers the rows up to 3
, and also the veggies
DataFrame has an index column that starts at 0
and goes up to 2
. If you join these too on the index column, and this is what you will get.
01:43
merge(fruits, veggies)
and then I will pass left_index
to True
to say that I want to use the index column to find the intersections in the left DataFrame and then also in the right one.
01:59 Now again, you can pause and think about what’s going to happen when you execute this line of code. I’m going to run it, and you can see that it combines all the data in the matching rows. Here, the matching rows are based on data that might not have meaning. In this case, it probably doesn’t, but depending on what your DataFrame is, this could be some sort of ID, for example.
02:24
Now here you have 0
is the apple column in the fruits
DataFrame. And zero is the tomato column in the veggies
DataFrame.
02:32
So pandas combines these two and puts the information of the fruits
DataFrame in here, prepending the repetitive column names with _x
and then next to it in the same row, it puts the information that comes from the veggies
DataFrame, and same with the other matches on index 1
, and index 2
and index 3
in the fruits
DataFrame doesn’t have a corresponding match in the veggies
DataFrame. So, because you’re performing the default inner join, it just gets dropped and isn’t part of the resulting DataFrame. This might make sense, like I said, if you have, for example, product IDs as your indices, it might make sense. In this case, it doesn’t really, but let’s look at the example where we assume that the index in the veggies
DataFrame is maybe actually an amount. Now for this, I will slightly edit the data we’ve been working with.
03:28
So I’ll just paste in here a slightly changed DataFrame for the fruits. So you see that I’m adding an amount
column and giving a couple of amounts to each of those items in there.
03:43
So now fruits
has three columns and the third one holds an amount. It still has the index column, right? And the veggies DataFrame looks the same as it did before. Now, we could assume that maybe in this dataset here, it means that there’s zero tomatoes, one potato, and two carrots. You know, that’s probably not the greatest index column because there might be repetitive values, but let’s give it a try and merge these two DataFrames on the amount
column for the fruits
DataFrame and on the index of the veggies
DataFrame, just to show you how you can mix and match all of these different ways of combining the data. Again, you can pause the video and try this yourself before you see the solution.
04:27
I’m going to do pd.merge()
, pass in fruits, veggies
. Now I want to join the left DataFrame on the "amount"
column, and on the right DataFrame, I want to use the index.
04:41
So I’m going to say right index=True
and left_on
, I’m specifying the "amount"
column.
04:48
And if I execute this, you get a combined DataFrame that might or might not make sense, but it found the row. In this case, the index row 0
in the veggies
DataFrame was matched with index row 2
of the fruits
DataFrame, because the amount here was zero, and that’s the column that you specified would be used to perform the merge.
05:13
So in here you see that the tomato here with the amount 0
got matched with the tomato over here from the veggies
DataFrame, and you can also pick apart the other ones. Let’s do one more. apple
here has an amount of 1
, and here you specified in the fruits
DataFrame that the amount
column would be the one to consider for the merge.
05:35
And here it found a matching column in the potato
that has the index 1
, which is why the apple and the potato got combined into a new row in this new DataFrame. And again, pandas keeps track of where does the data come from and adds this automatic _x
for the left DataFrame and automatic _y
for potential duplicate column names from the right DataFrame.
06:01
In this way, you can mix and match. You can define left_on
and right_on
, you can define right_index
and left_index
, and pick and choose which columns or even indices in your DataFrames contain the information that you want to perform the merge on.
06:19 So this is very flexible and gives you a lot of opportunities. And depending on your data, some of them might make sense, and others won’t make sense, of course.
06:30
In the next lesson, you will learn how you can modify _x
and _y
, that default appended strings that pandas creates, and make it a little more descriptive of which DataFrame did it actually come from, or what was the name of the DataFrame, for example.
Become a Member to join the conversation.