Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Join DataFrames Using Index Columns

00:00 In the previous lesson, you learned how you can specify the columns that you want to join your two DataFrames on using the on parameter for pd.merge().

00:10 I also mentioned that you can join DataFrames on their index columns as well. Let’s take a look at the function signature to see how you can do that.

00:20 Over here in my Jupyter notebooks, I’m going to bring up the docstring and then scroll down to the additional parameters that you have to pd.merge().

00:31 Now we’ve looked at the how parameter and a couple of the options that you have in there, and then you’ve also looked at on,

00:39 and now there’s a couple of other options. You have left_on, which allows you to specify which columns or combination of columns to use in the left DataFrame that you want to combine with the right DataFrame.

00:51 So you can mix and match here. And then you also have the options of left_index and right_index, which default to False, both of them, but by setting these ones to True, you can ask pandas to use the index columns instead of some of the named columns for performing the join.

01:14 I will give it a try in here by using both index columns at first. So in the fruits DataFrame, it looks like this, and you can see it has an index column that starts at 0 and numbers the rows up to 3, and also the veggies DataFrame has an index column that starts at 0 and goes up to 2. If you join these too on the index column, and this is what you will get.

01:43 merge(fruits, veggies) and then I will pass left_index to True to say that I want to use the index column to find the intersections in the left DataFrame and then also in the right one.

01:59 Now again, you can pause and think about what’s going to happen when you execute this line of code. I’m going to run it, and you can see that it combines all the data in the matching rows. Here, the matching rows are based on data that might not have meaning. In this case, it probably doesn’t, but depending on what your DataFrame is, this could be some sort of ID, for example.

02:24 Now here you have 0 is the apple column in the fruits DataFrame. And zero is the tomato column in the veggies DataFrame.

02:32 So pandas combines these two and puts the information of the fruits DataFrame in here, prepending the repetitive column names with _x and then next to it in the same row, it puts the information that comes from the veggies DataFrame, and same with the other matches on index 1, and index 2 and index 3 in the fruits DataFrame doesn’t have a corresponding match in the veggies DataFrame. So, because you’re performing the default inner join, it just gets dropped and isn’t part of the resulting DataFrame. This might make sense, like I said, if you have, for example, product IDs as your indices, it might make sense. In this case, it doesn’t really, but let’s look at the example where we assume that the index in the veggies DataFrame is maybe actually an amount. Now for this, I will slightly edit the data we’ve been working with.

03:28 So I’ll just paste in here a slightly changed DataFrame for the fruits. So you see that I’m adding an amount column and giving a couple of amounts to each of those items in there.

03:43 So now fruits has three columns and the third one holds an amount. It still has the index column, right? And the veggies DataFrame looks the same as it did before. Now, we could assume that maybe in this dataset here, it means that there’s zero tomatoes, one potato, and two carrots. You know, that’s probably not the greatest index column because there might be repetitive values, but let’s give it a try and merge these two DataFrames on the amount column for the fruits DataFrame and on the index of the veggies DataFrame, just to show you how you can mix and match all of these different ways of combining the data. Again, you can pause the video and try this yourself before you see the solution.

04:27 I’m going to do pd.merge(), pass in fruits, veggies. Now I want to join the left DataFrame on the "amount" column, and on the right DataFrame, I want to use the index.

04:41 So I’m going to say right index=True and left_on, I’m specifying the "amount" column.

04:48 And if I execute this, you get a combined DataFrame that might or might not make sense, but it found the row. In this case, the index row 0 in the veggies DataFrame was matched with index row 2 of the fruits DataFrame, because the amount here was zero, and that’s the column that you specified would be used to perform the merge.

05:13 So in here you see that the tomato here with the amount 0 got matched with the tomato over here from the veggies DataFrame, and you can also pick apart the other ones. Let’s do one more. apple here has an amount of 1, and here you specified in the fruits DataFrame that the amount column would be the one to consider for the merge.

05:35 And here it found a matching column in the potato that has the index 1, which is why the apple and the potato got combined into a new row in this new DataFrame. And again, pandas keeps track of where does the data come from and adds this automatic _x for the left DataFrame and automatic _y for potential duplicate column names from the right DataFrame.

06:01 In this way, you can mix and match. You can define left_on and right_on, you can define right_index and left_index, and pick and choose which columns or even indices in your DataFrames contain the information that you want to perform the merge on.

06:19 So this is very flexible and gives you a lot of opportunities. And depending on your data, some of them might make sense, and others won’t make sense, of course.

06:30 In the next lesson, you will learn how you can modify _x and _y, that default appended strings that pandas creates, and make it a little more descriptive of which DataFrame did it actually come from, or what was the name of the DataFrame, for example.

Become a Member to join the conversation.