Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Perform an Inner Join Using merge()

00:00 In this section of the course, you’ll learn how to use pd.merge() to combine data in pandas. With pd.merge(), you’re creating join operations, which are different from plain concatenation.

00:12 You’ve used some join logic before when you were working with concat() in the previous section, where you used first, the default outer join for the second axis—so the one that you weren’t concatenating on—and then later towards the end of that section, you also changed that to use an inner join.

00:29 And then you saw that the results were quite different between those two. In this section, you’ll explore more about this join functionality, which is the basis of pd.merge().

00:41 And you will start off with an inner join, so similar to what you did on the second axis with pd.concat(), but now performing it on both of the DataFrames that you’re working with. Later in this section, you’ll learn more about other types of join you can perform using pd.merge() and also about optional arguments that you can pass to this function.

01:04 A quick call to pd.merge() would look something like this. You open up brackets, and then as the first argument, you pass the first DataFrame that you want to merge. So let’s say merge(fruits And then the second argument is going to be the DataFrame.

01:19 So that’s the most fundamental call of pd.merge(). You can see opposed to pd.concat(), you’re not passing an iterable in here, but you’re really passing as the first argument, the first DataFrame, and as the second argument, the second DataFrame.

01:35 And you will see later that you can think of them as the left DataFrame, which is the first one you input, and the right DataFrame as the second one that you input.

01:45 Now you can pause this lesson for a second and think about what’s your expectation? What’s going to happen when you merge these two DataFrames, as opposed to concatenating them?

01:55 And then before you’re actually going to go ahead and run this together with me, we’re going take a moment and look at this again with a little scribble on the whiteboard.

02:07 Here’s the fruits DataFrame—two columns, four rows—and here’s the veggies DataFrame—three columns and three rows. Now you might remember that they both share two columns.

02:25 You have a name column and an image column, and both of those exist also in the other DataFrame—image imageand then you also have a color column in the vegetable DataFrame.

02:38 And you might remember that somewhere here in the fruits DataFrame, you have a tomato with a nice image of a tomato,

02:47 and also the vegetables DataFrame has a tomato with an image of a tomato, and then also its color mentioned in there. Everything else that is inside of these two DataFrames is different from each other. There’s no other apple in the vegetables DataFrame, there’s no carrot in the fruits DataFrame, et cetera.

03:08 So none of these have a match in the other DataFrame.

03:15 Oops.

03:18 So when you perform a default join using pd.merge()

03:24 then pandas picks out the information that is shared across both of the DataFrames. In that case, that will only be the tomato here.

03:37 And it doesn’t just pick the name and the image of that tomato, but it finds which rows can be matched together, and it has some pretty good heuristics on figuring out which columns it should look at. So in this case, it’s going to look at both of these columns, name and image, and figure out that there’s one match that matches on both of these columns and then output a combination of these two, where nothing has been duplicated, but you just see the information that is present in both and anything extra in any of the two DataFrames.

04:13 So your output here is going to be a one-times-three DataFrame. So it has three columns: name, image, and color. And the only thing that’s going to be in there is the tomato, the beautiful image of the tomato, and its color, because this is where pandas was able to match: name and image, and then this was the extra information that was present in one of the two DataFrames, but not in the other. So let’s look at this when running it with pandas.

04:50 Coming back here, when I execute the cell, I can see that the output is indeed the name tomato, the image of the tomato, and then the color. This is the only thing that a default merge is going to return.

05:05 Let’s take a look at what this default merge does.

05:11 Like I mentioned earlier, it takes the left DataFrame, the right DataFrame, and then how it’s supposed to join these two, and it’s going to take the default of an inner join.

05:21 If you remember further up, when you were talking about concat(), you were able to use join="inner" to define that it should, on the second axis, it should throw away anything that doesn’t have a match.

05:33 And you can think about this inner join here to do something similar. pandas figures out which columns to base the join on. In that case, there will be name and image. It can be one column, or it can be more than one column. And it throws away anything that doesn’t have a row corresponding to these two matches, and this case it’s only one row. So you get as a return the row with all the information based on this index match.

06:04 In this lesson, you got started working with pd.merge(), which is a pandas function that allows you to merge DataFrame objects using a database-style join.

06:15 You used pd.merge() with the default parameters, which creates an inner join, and that works by just passing two DataFrames directly as arguments to the pd.merge() function.

06:27 So the difference to pd.concat() before is that you don’t put them inside of an iterable, but they’re actually the first two arguments, df1 and df2, which was your fruits and veggies DataFrame.

06:40 To exemplify this a little more with the function signature, you used implicitly the how parameter with the default argument of "inner" to perform an inner join of those two DataFrame objects. In the next lesson, you’ll learn how to change the how parameter and pass a different argument to perform an outer join.

Become a Member to join the conversation.