Specify the Join Columns Explicitly
In previous lessons, you’ve heard me talk about how pandas automatically figures out the intersection between two DataFrames to pick which columns to use to perform the join operation on. And in our case, between the
fruits DataFrame, those were the
Over here in the Jupyter notebook, I’m first going to reset the fruits DataFrame to drop the
taste column that I added for an example in the previous lesson, because currently
fruits looks different than it did originally.
Now what you did previously—you used
pd.merge(fruits, veggies) with the default of allowing pandas to figure out what is the intersection between those two DataFrames, and those are the columns
image, and then pandas performed the merge.
And in this case, an inner merge, dropping all the data that didn’t have a match. Now I can define a different specific column that I want to match on. So to say
pd.merge(fruits, veggies) and then say
on, for example, just the
pandas is not comparing whether they’re the same. And so it needs to do something with the second image data. And what it does here by default is it keeps both of those around. The only match and combination happened on
name, so here you do not have a duplicate value, and the name only appears once.
But both in the X column, the left column, as well as in the right column, you had an
image column, and they both contain some data on that matched row. pandas wants to keep both of those around, so now in your result, you get two images of the tomato with this label added automatically added to the column names that allows you to distinguish where did they come from originally?
So if you take one step back from here and try to figure out what actually happened up in
pd.merge(), you’ll see that the
on parameter, like I mentioned earlier, matched on the
name as well as the
03:04 This is how you get the result where neither of those are duplicated because pandas looked at both of those columns and found matching data in both of them, so there was no need to duplicate it.
on column or index level names to join on these, must be found in both DataFrames. So as you can see,
color isn’t found in both of the DataFrames, which is why pandas throws a
KeyError if you try to pass a column name that doesn’t exist in both of them.
Okay, so this is how you can specify what columns you want to perform the merge on. And as you can see, you have this appendix to the name of the column so that you can keep track of where did the information originally come from, even if it’s a duplicate column name between the two DataFrames that you’ve chosen not to merge on. In this lesson, you’ve learned that pandas by default finds the intersection between two DataFrames—in our case, that’s the
name and the
image column—and then merges in respect to both of those or as many as there are.
05:17 And you can also define a single or a composite list of column names and perform the merge on a single column. If you do that and there’s duplicate column names, then the data gets duplicated as well, and pandas puts some indicator to figure out where did the data come from initially.
Now, if you think about the function signature here and the docstring, it also mentioned that you can join on indices. So in next lesson, you will take a look at how you can join two DataFrames using
pd.merge() on an index column.
Become a Member to join the conversation.