Join DataFrames Using Index Columns

Combining Data in pandas With concat() and merge() Martin Breuss 06:48

00:00 In the previous lesson, you learned how you can specify the columns that you want to join your two DataFrames on using the on parameter for pd.merge().

00:10 I also mentioned that you can join DataFrames on their index columns as well. Let’s take a look at the function signature to see how you can do that.

00:20 Over here in my Jupyter notebooks, I’m going to bring up the docstring and then scroll down to the additional parameters that you have to pd.merge().

00:31 Now we’ve looked at the how parameter and a couple of the options that you have in there, and then you’ve also looked at on,

00:39 and now there’s a couple of other options. You have left_on, which allows you to specify which columns or combination of columns to use in the left DataFrame that you want to combine with the right DataFrame.

00:51 So you can mix and match here. And then you also have the options of left_index and right_index, which default to False, both of them, but by setting these ones to True, you can ask pandas to use the index columns instead of some of the named columns for performing the join.

01:14 I will give it a try in here by using both index columns at first. So in the fruits DataFrame, it looks like this, and you can see it has an index column that starts at 0 and numbers the rows up to 3, and also the veggies DataFrame has an index column that starts at 0 and goes up to 2. If you join these too on the index column, and this is what you will get.

01:43 merge(fruits, veggies) and then I will pass left_index to True to say that I want to use the index column to find the intersections in the left DataFrame and then also in the right one.

01:59 Now again, you can pause and think about what’s going to happen when you execute this line of code. I’m going to run it, and you can see that it combines all the data in the matching rows. Here, the matching rows are based on data that might not have meaning. In this case, it probably doesn’t, but depending on what your DataFrame is, this could be some sort of ID, for example.

02:24 Now here you have 0 is the apple column in the fruits DataFrame. And zero is the tomato column in the veggies DataFrame.

03:28 So I’ll just paste in here a slightly changed DataFrame for the fruits. So you see that I’m adding an amount column and giving a couple of amounts to each of those items in there.

04:27 I’m going to do pd.merge(), pass in fruits, veggies. Now I want to join the left DataFrame on the "amount" column, and on the right DataFrame, I want to use the index.

04:41 So I’m going to say right index=True and left_on, I’m specifying the "amount" column.

04:48 And if I execute this, you get a combined DataFrame that might or might not make sense, but it found the row. In this case, the index row 0 in the veggies DataFrame was matched with index row 2 of the fruits DataFrame, because the amount here was zero, and that’s the column that you specified would be used to perform the merge.

05:13 So in here you see that the tomato here with the amount 0 got matched with the tomato over here from the veggies DataFrame, and you can also pick apart the other ones. Let’s do one more. apple here has an amount of 1, and here you specified in the fruits DataFrame that the amount column would be the one to consider for the merge.

05:35 And here it found a matching column in the potato that has the index 1, which is why the apple and the potato got combined into a new row in this new DataFrame. And again, pandas keeps track of where does the data come from and adds this automatic _x for the left DataFrame and automatic _y for potential duplicate column names from the right DataFrame.

06:01 In this way, you can mix and match. You can define left_on and right_on, you can define right_index and left_index, and pick and choose which columns or even indices in your DataFrames contain the information that you want to perform the merge on.

06:19 So this is very flexible and gives you a lot of opportunities. And depending on your data, some of them might make sense, and others won’t make sense, of course.

06:30 In the next lesson, you will learn how you can modify _x and _y, that default appended strings that pandas creates, and make it a little more descriptive of which DataFrame did it actually come from, or what was the name of the DataFrame, for example.

Become a Member to join the conversation.