Perform an Inner Join Using merge()
You’ve used some join logic before when you were working with
concat() in the previous section, where you used first, the default outer join for the second axis—so the one that you weren’t concatenating on—and then later towards the end of that section, you also changed that to use an inner join.
And you will start off with an inner join, so similar to what you did on the second axis with
pd.concat(), but now performing it on both of the DataFrames that you’re working with. Later in this section, you’ll learn more about other types of join you can perform using
pd.merge() and also about optional arguments that you can pass to this function.
A quick call to
pd.merge() would look something like this. You open up brackets, and then as the first argument, you pass the first DataFrame that you want to merge. So let’s say
merge(fruits … And then the second argument is going to be the DataFrame.
So that’s the most fundamental call of
pd.merge(). You can see opposed to
pd.concat(), you’re not passing an iterable in here, but you’re really passing as the first argument, the first DataFrame, and as the second argument, the second DataFrame.
02:47 and also the vegetables DataFrame has a tomato with an image of a tomato, and then also its color mentioned in there. Everything else that is inside of these two DataFrames is different from each other. There’s no other apple in the vegetables DataFrame, there’s no carrot in the fruits DataFrame, et cetera.
And it doesn’t just pick the name and the image of that tomato, but it finds which rows can be matched together, and it has some pretty good heuristics on figuring out which columns it should look at. So in this case, it’s going to look at both of these columns,
image, and figure out that there’s one match that matches on both of these columns and then output a combination of these two, where nothing has been duplicated, but you just see the information that is present in both and anything extra in any of the two DataFrames.
So your output here is going to be a one-times-three DataFrame. So it has three columns:
color. And the only thing that’s going to be in there is the tomato, the beautiful image of the tomato, and its color, because this is where pandas was able to match:
image, and then this was the extra information that was present in one of the two DataFrames, but not in the other. So let’s look at this when running it with pandas.
04:50 Coming back here, when I execute the cell, I can see that the output is indeed the name tomato, the image of the tomato, and then the color. This is the only thing that a default merge is going to return.
If you remember further up, when you were talking about
concat(), you were able to use
join="inner" to define that it should, on the second axis, it should throw away anything that doesn’t have a match.
And you can think about this inner join here to do something similar. pandas figures out which columns to base the join on. In that case, there will be
image. It can be one column, or it can be more than one column. And it throws away anything that doesn’t have a row corresponding to these two matches, and this case it’s only one row. So you get as a return the row with all the information based on this index match.
To exemplify this a little more with the function signature, you used implicitly the
how parameter with the default argument of
"inner" to perform an inner join of those two DataFrame objects. In the next lesson, you’ll learn how to change the
how parameter and pass a different argument to perform an outer join.
Become a Member to join the conversation.