Specify the Join Columns Explicitly
In previous lessons, you’ve heard me talk about how pandas automatically figures out the intersection between two DataFrames to pick which columns to use to perform the join operation on. And in our case, between the
fruits DataFrame, those were the
00:17 You can also define specifically which column you want to use to join the DataFrames on. And you learn about how to do this in the current lesson.
Over here in the Jupyter notebook, I’m first going to reset the fruits DataFrame to drop the
taste column that I added for an example in the previous lesson, because currently
fruits looks different than it did originally.
00:43 And I’m just going to go back to the previous version by saying `fruits.drop(“taste”) on the columns axis.
And we will need to reassign this to the
fruits DataFrame so that now we’re back to the start with the familiar
Now what you did previously—you used
pd.merge(fruits, veggies) with the default of allowing pandas to figure out what is the intersection between those two DataFrames, and those are the columns
image, and then pandas performed the merge.
And in this case, an inner merge, dropping all the data that didn’t have a match. Now I can define a different specific column that I want to match on. So to say
pd.merge(fruits, veggies) and then say
on, for example, just the
Then pandas is going to compare the
veggies and the
fruits DataFrame and figure out where is a match in the
name column, and then keep that around.
You might already think, oh, well it’s still going to be just
tomato and you’re right with this, but since you’re not matching on the
image column, something different is going to happen.
pandas is not comparing whether they’re the same. And so it needs to do something with the second image data. And what it does here by default is it keeps both of those around. The only match and combination happened on
name, so here you do not have a duplicate value, and the name only appears once.
But both in the X column, the left column, as well as in the right column, you had an
image column, and they both contain some data on that matched row. pandas wants to keep both of those around, so now in your result, you get two images of the tomato with this label added automatically added to the column names that allows you to distinguish where did they come from originally?
So if you take one step back from here and try to figure out what actually happened up in
pd.merge(), you’ll see that the
on parameter, like I mentioned earlier, matched on the
name as well as the
03:04 This is how you get the result where neither of those are duplicated because pandas looked at both of those columns and found matching data in both of them, so there was no need to duplicate it.
Now you might already know what happens when you do the same thing, but define
"image". Give it a try.
And as you can see, now it keeps the information of the
name column of both the left as well as the right DataFrame. So you have duplicate values for the name of the tomato.
Now you might wonder what happens if you try to merge on a different column, let’s say the
color column. Merge, and you know, you can always give it a try and see what happens,
on="color". Pause the video and think about what you expect to happen. Is it going to be an error or an empty DataFrame or something different?
And pandas gives you a
KeyError that tells you that
color, the name of the column that you passed, does not exist in both of the DataFrames.
If you go back into the function signature of
pd.merge(), you will see what that is the case. There’s the docstring:
on column or index level names to join on these, must be found in both DataFrames. So as you can see,
color isn’t found in both of the DataFrames, which is why pandas throws a
KeyError if you try to pass a column name that doesn’t exist in both of them.
Okay, so this is how you can specify what columns you want to perform the merge on. And as you can see, you have this appendix to the name of the column so that you can keep track of where did the information originally come from, even if it’s a duplicate column name between the two DataFrames that you’ve chosen not to merge on. In this lesson, you’ve learned that pandas by default finds the intersection between two DataFrames—in our case, that’s the
name and the
image column—and then merges in respect to both of those or as many as there are.
05:17 And you can also define a single or a composite list of column names and perform the merge on a single column. If you do that and there’s duplicate column names, then the data gets duplicated as well, and pandas puts some indicator to figure out where did the data come from initially.
05:38 And you also saw that if you use a column name that is not present in both DataFames, then you will run into an error when you try to perform the merge.
Now, if you think about the function signature here and the docstring, it also mentioned that you can join on indices. So in next lesson, you will take a look at how you can join two DataFrames using
pd.merge() on an index column.
Become a Member to join the conversation.