Specify the Join Columns Explicitly
00:00
In previous lessons, you’ve heard me talk about how pandas automatically figures out the intersection between two DataFrames to pick which columns to use to perform the join operation on. And in our case, between the veggies
and fruits
DataFrame, those were the name
and image
columns.
00:17 You can also define specifically which column you want to use to join the DataFrames on. And you learn about how to do this in the current lesson.
00:29
Over here in the Jupyter notebook, I’m first going to reset the fruits DataFrame to drop the taste
column that I added for an example in the previous lesson, because currently fruits
looks different than it did originally.
00:43 And I’m just going to go back to the previous version by saying `fruits.drop(“taste”) on the columns axis.
00:56
And we will need to reassign this to the fruits
DataFrame so that now we’re back to the start with the familiar fruits
DataFrame.
01:07
Now what you did previously—you used pd.merge(fruits, veggies)
with the default of allowing pandas to figure out what is the intersection between those two DataFrames, and those are the columns name
and image
, and then pandas performed the merge.
01:25
And in this case, an inner merge, dropping all the data that didn’t have a match. Now I can define a different specific column that I want to match on. So to say pd.merge(fruits, veggies)
and then say on
, for example, just the "name"
column.
01:45
Then pandas is going to compare the veggies
and the fruits
DataFrame and figure out where is a match in the name
column, and then keep that around.
01:55
You might already think, oh, well it’s still going to be just tomato
and you’re right with this, but since you’re not matching on the image
column, something different is going to happen.
02:04
pandas is not comparing whether they’re the same. And so it needs to do something with the second image data. And what it does here by default is it keeps both of those around. The only match and combination happened on name
, so here you do not have a duplicate value, and the name only appears once.
02:22
But both in the X column, the left column, as well as in the right column, you had an image
column, and they both contain some data on that matched row. pandas wants to keep both of those around, so now in your result, you get two images of the tomato with this label added automatically added to the column names that allows you to distinguish where did they come from originally?
02:48
So if you take one step back from here and try to figure out what actually happened up in pd.merge()
, you’ll see that the on
parameter, like I mentioned earlier, matched on the name
as well as the image
column.
03:04 This is how you get the result where neither of those are duplicated because pandas looked at both of those columns and found matching data in both of them, so there was no need to duplicate it.
03:16
Now you might already know what happens when you do the same thing, but define "image"
. Give it a try.
03:25
And as you can see, now it keeps the information of the name
column of both the left as well as the right DataFrame. So you have duplicate values for the name of the tomato.
03:38
Now you might wonder what happens if you try to merge on a different column, let’s say the color
column. Merge, and you know, you can always give it a try and see what happens,
03:50
on="color"
. Pause the video and think about what you expect to happen. Is it going to be an error or an empty DataFrame or something different?
04:02
And pandas gives you a KeyError
that tells you that color
, the name of the column that you passed, does not exist in both of the DataFrames.
04:12
If you go back into the function signature of pd.merge()
, you will see what that is the case. There’s the docstring:
04:23
on column or index level names to join on these, must be found in both DataFrames. So as you can see, color
isn’t found in both of the DataFrames, which is why pandas throws a KeyError
if you try to pass a column name that doesn’t exist in both of them.
04:41
Okay, so this is how you can specify what columns you want to perform the merge on. And as you can see, you have this appendix to the name of the column so that you can keep track of where did the information originally come from, even if it’s a duplicate column name between the two DataFrames that you’ve chosen not to merge on. In this lesson, you’ve learned that pandas by default finds the intersection between two DataFrames—in our case, that’s the name
and the image
column—and then merges in respect to both of those or as many as there are.
05:17 And you can also define a single or a composite list of column names and perform the merge on a single column. If you do that and there’s duplicate column names, then the data gets duplicated as well, and pandas puts some indicator to figure out where did the data come from initially.
05:38 And you also saw that if you use a column name that is not present in both DataFames, then you will run into an error when you try to perform the merge.
05:51
Now, if you think about the function signature here and the docstring, it also mentioned that you can join on indices. So in next lesson, you will take a look at how you can join two DataFrames using pd.merge()
on an index column.
Become a Member to join the conversation.