Perform an Inner Join Using merge()
In this section of the course, you’ll learn how to use
pd.merge() to combine data in pandas. With
pd.merge(), you’re creating join operations, which are different from plain concatenation.
You’ve used some join logic before when you were working with
concat() in the previous section, where you used first, the default outer join for the second axis—so the one that you weren’t concatenating on—and then later towards the end of that section, you also changed that to use an inner join.
And then you saw that the results were quite different between those two. In this section, you’ll explore more about this join functionality, which is the basis of
And you will start off with an inner join, so similar to what you did on the second axis with
pd.concat(), but now performing it on both of the DataFrames that you’re working with. Later in this section, you’ll learn more about other types of join you can perform using
pd.merge() and also about optional arguments that you can pass to this function.
A quick call to
pd.merge() would look something like this. You open up brackets, and then as the first argument, you pass the first DataFrame that you want to merge. So let’s say
merge(fruits … And then the second argument is going to be the DataFrame.
So that’s the most fundamental call of
pd.merge(). You can see opposed to
pd.concat(), you’re not passing an iterable in here, but you’re really passing as the first argument, the first DataFrame, and as the second argument, the second DataFrame.
01:35 And you will see later that you can think of them as the left DataFrame, which is the first one you input, and the right DataFrame as the second one that you input.
01:45 Now you can pause this lesson for a second and think about what’s your expectation? What’s going to happen when you merge these two DataFrames, as opposed to concatenating them?
01:55 And then before you’re actually going to go ahead and run this together with me, we’re going take a moment and look at this again with a little scribble on the whiteboard.
fruits DataFrame—two columns, four rows—and here’s the
veggies DataFrame—three columns and three rows. Now you might remember that they both share two columns.
You have a
name column and an
image column, and both of those exist also in the other DataFrame—
image—and then you also have a
color column in the vegetable DataFrame.
02:38 And you might remember that somewhere here in the fruits DataFrame, you have a tomato with a nice image of a tomato,
02:47 and also the vegetables DataFrame has a tomato with an image of a tomato, and then also its color mentioned in there. Everything else that is inside of these two DataFrames is different from each other. There’s no other apple in the vegetables DataFrame, there’s no carrot in the fruits DataFrame, et cetera.
03:08 So none of these have a match in the other DataFrame.
So when you perform a default join using
03:24 then pandas picks out the information that is shared across both of the DataFrames. In that case, that will only be the tomato here.
And it doesn’t just pick the name and the image of that tomato, but it finds which rows can be matched together, and it has some pretty good heuristics on figuring out which columns it should look at. So in this case, it’s going to look at both of these columns,
image, and figure out that there’s one match that matches on both of these columns and then output a combination of these two, where nothing has been duplicated, but you just see the information that is present in both and anything extra in any of the two DataFrames.
So your output here is going to be a one-times-three DataFrame. So it has three columns:
color. And the only thing that’s going to be in there is the tomato, the beautiful image of the tomato, and its color, because this is where pandas was able to match:
image, and then this was the extra information that was present in one of the two DataFrames, but not in the other. So let’s look at this when running it with pandas.
04:50 Coming back here, when I execute the cell, I can see that the output is indeed the name tomato, the image of the tomato, and then the color. This is the only thing that a default merge is going to return.
05:05 Let’s take a look at what this default merge does.
05:11 Like I mentioned earlier, it takes the left DataFrame, the right DataFrame, and then how it’s supposed to join these two, and it’s going to take the default of an inner join.
If you remember further up, when you were talking about
concat(), you were able to use
join="inner" to define that it should, on the second axis, it should throw away anything that doesn’t have a match.
And you can think about this inner join here to do something similar. pandas figures out which columns to base the join on. In that case, there will be
image. It can be one column, or it can be more than one column. And it throws away anything that doesn’t have a row corresponding to these two matches, and this case it’s only one row. So you get as a return the row with all the information based on this index match.
In this lesson, you got started working with
pd.merge(), which is a pandas function that allows you to merge DataFrame objects using a database-style join.
pd.merge() with the default parameters, which creates an inner join, and that works by just passing two DataFrames directly as arguments to the
So the difference to
pd.concat() before is that you don’t put them inside of an iterable, but they’re actually the first two arguments,
df2, which was your fruits and veggies DataFrame.
To exemplify this a little more with the function signature, you used implicitly the
how parameter with the default argument of
"inner" to perform an inner join of those two DataFrame objects. In the next lesson, you’ll learn how to change the
how parameter and pass a different argument to perform an outer join.
Become a Member to join the conversation.