Join DataFrames Using Index Columns
In the previous lesson, you learned how you can specify the columns that you want to join your two DataFrames on using the
on parameter for
00:10 I also mentioned that you can join DataFrames on their index columns as well. Let’s take a look at the function signature to see how you can do that.
Over here in my Jupyter notebooks, I’m going to bring up the docstring and then scroll down to the additional parameters that you have to
Now we’ve looked at the
how parameter and a couple of the options that you have in there, and then you’ve also looked at
and now there’s a couple of other options. You have
left_on, which allows you to specify which columns or combination of columns to use in the left DataFrame that you want to combine with the right DataFrame.
So you can mix and match here. And then you also have the options of
right_index, which default to
False, both of them, but by setting these ones to
True, you can ask pandas to use the index columns instead of some of the named columns for performing the join.
I will give it a try in here by using both index columns at first. So in the
fruits DataFrame, it looks like this, and you can see it has an index column that starts at
0 and numbers the rows up to
3, and also the
veggies DataFrame has an index column that starts at
0 and goes up to
2. If you join these too on the index column, and this is what you will get.
merge(fruits, veggies) and then I will pass
True to say that I want to use the index column to find the intersections in the left DataFrame and then also in the right one.
01:59 Now again, you can pause and think about what’s going to happen when you execute this line of code. I’m going to run it, and you can see that it combines all the data in the matching rows. Here, the matching rows are based on data that might not have meaning. In this case, it probably doesn’t, but depending on what your DataFrame is, this could be some sort of ID, for example.
Now here you have
0 is the apple column in the
fruits DataFrame. And zero is the tomato column in the
So pandas combines these two and puts the information of the
fruits DataFrame in here, prepending the repetitive column names with
_x and then next to it in the same row, it puts the information that comes from the
veggies DataFrame, and same with the other matches on index
1, and index
2 and index
3 in the
fruits DataFrame doesn’t have a corresponding match in the
veggies DataFrame. So, because you’re performing the default inner join, it just gets dropped and isn’t part of the resulting DataFrame. This might make sense, like I said, if you have, for example, product IDs as your indices, it might make sense. In this case, it doesn’t really, but let’s look at the example where we assume that the index in the
veggies DataFrame is maybe actually an amount. Now for this, I will slightly edit the data we’ve been working with.
So I’ll just paste in here a slightly changed DataFrame for the fruits. So you see that I’m adding an
amount column and giving a couple of amounts to each of those items in there.
fruits has three columns and the third one holds an amount. It still has the index column, right? And the veggies DataFrame looks the same as it did before. Now, we could assume that maybe in this dataset here, it means that there’s zero tomatoes, one potato, and two carrots. You know, that’s probably not the greatest index column because there might be repetitive values, but let’s give it a try and merge these two DataFrames on the
amount column for the
fruits DataFrame and on the index of the
veggies DataFrame, just to show you how you can mix and match all of these different ways of combining the data. Again, you can pause the video and try this yourself before you see the solution.
I’m going to do
pd.merge(), pass in
fruits, veggies. Now I want to join the left DataFrame on the
"amount" column, and on the right DataFrame, I want to use the index.
So I’m going to say
right index=True and
left_on, I’m specifying the
And if I execute this, you get a combined DataFrame that might or might not make sense, but it found the row. In this case, the index row
0 in the
veggies DataFrame was matched with index row
2 of the
fruits DataFrame, because the amount here was zero, and that’s the column that you specified would be used to perform the merge.
So in here you see that the tomato here with the amount
0 got matched with the tomato over here from the
veggies DataFrame, and you can also pick apart the other ones. Let’s do one more.
apple here has an amount of
1, and here you specified in the
fruits DataFrame that the
amount column would be the one to consider for the merge.
And here it found a matching column in the
potato that has the index
1, which is why the apple and the potato got combined into a new row in this new DataFrame. And again, pandas keeps track of where does the data come from and adds this automatic
_x for the left DataFrame and automatic
_y for potential duplicate column names from the right DataFrame.
In this way, you can mix and match. You can define
right_on, you can define
left_index, and pick and choose which columns or even indices in your DataFrames contain the information that you want to perform the merge on.
06:19 So this is very flexible and gives you a lot of opportunities. And depending on your data, some of them might make sense, and others won’t make sense, of course.
In the next lesson, you will learn how you can modify
_y, that default appended strings that pandas creates, and make it a little more descriptive of which DataFrame did it actually come from, or what was the name of the DataFrame, for example.
Become a Member to join the conversation.