Perform an Inner Join Using merge()

Combining Data in pandas With concat() and merge() Martin Breuss 07:01

00:00 In this section of the course, you’ll learn how to use pd.merge() to combine data in pandas. With pd.merge(), you’re creating join operations, which are different from plain concatenation.

00:12 You’ve used some join logic before when you were working with concat() in the previous section, where you used first, the default outer join for the second axis—so the one that you weren’t concatenating on—and then later towards the end of that section, you also changed that to use an inner join.

00:29 And then you saw that the results were quite different between those two. In this section, you’ll explore more about this join functionality, which is the basis of pd.merge().

00:41 And you will start off with an inner join, so similar to what you did on the second axis with pd.concat(), but now performing it on both of the DataFrames that you’re working with. Later in this section, you’ll learn more about other types of join you can perform using pd.merge() and also about optional arguments that you can pass to this function.

01:04 A quick call to pd.merge() would look something like this. You open up brackets, and then as the first argument, you pass the first DataFrame that you want to merge. So let’s say merge(fruits … And then the second argument is going to be the DataFrame.

01:19 So that’s the most fundamental call of pd.merge(). You can see opposed to pd.concat(), you’re not passing an iterable in here, but you’re really passing as the first argument, the first DataFrame, and as the second argument, the second DataFrame.

01:35 And you will see later that you can think of them as the left DataFrame, which is the first one you input, and the right DataFrame as the second one that you input.

01:45 Now you can pause this lesson for a second and think about what’s your expectation? What’s going to happen when you merge these two DataFrames, as opposed to concatenating them?

01:55 And then before you’re actually going to go ahead and run this together with me, we’re going take a moment and look at this again with a little scribble on the whiteboard.

02:07 Here’s the fruits DataFrame—two columns, four rows—and here’s the veggies DataFrame—three columns and three rows. Now you might remember that they both share two columns.

02:25 You have a name column and an image column, and both of those exist also in the other DataFrame—image image—and then you also have a color column in the vegetable DataFrame.

02:38 And you might remember that somewhere here in the fruits DataFrame, you have a tomato with a nice image of a tomato,

02:47 and also the vegetables DataFrame has a tomato with an image of a tomato, and then also its color mentioned in there. Everything else that is inside of these two DataFrames is different from each other. There’s no other apple in the vegetables DataFrame, there’s no carrot in the fruits DataFrame, et cetera.

03:08 So none of these have a match in the other DataFrame.

03:15 Oops.

03:18 So when you perform a default join using pd.merge()

03:24 then pandas picks out the information that is shared across both of the DataFrames. In that case, that will only be the tomato here.

04:50 Coming back here, when I execute the cell, I can see that the output is indeed the name tomato, the image of the tomato, and then the color. This is the only thing that a default merge is going to return.

05:05 Let’s take a look at what this default merge does.

05:11 Like I mentioned earlier, it takes the left DataFrame, the right DataFrame, and then how it’s supposed to join these two, and it’s going to take the default of an inner join.

05:21 If you remember further up, when you were talking about concat(), you were able to use join="inner" to define that it should, on the second axis, it should throw away anything that doesn’t have a match.

06:04 In this lesson, you got started working with pd.merge(), which is a pandas function that allows you to merge DataFrame objects using a database-style join.

06:15 You used pd.merge() with the default parameters, which creates an inner join, and that works by just passing two DataFrames directly as arguments to the pd.merge() function.

06:27 So the difference to pd.concat() before is that you don’t put them inside of an iterable, but they’re actually the first two arguments, df1 and df2, which was your fruits and veggies DataFrame.

06:40 To exemplify this a little more with the function signature, you used implicitly the how parameter with the default argument of "inner" to perform an inner join of those two DataFrame objects. In the next lesson, you’ll learn how to change the how parameter and pass a different argument to perform an outer join.

Become a Member to join the conversation.