Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Perform an Outer Join Using merge()

Martin Breuss

Combining Data in pandas With concat() and merge() Martin Breuss 06:19

Transcript
Discussion

00:00 In the previous lesson, you got started with combining two DataFrames using pd.merge(), and you did it using the default parameter of how="inner".

00:11 In this lesson, you’ll learn how to perform an outer join using the same function, but passing a different argument to how.

00:21 So over here in a Jupyter notebook, I’m going to type in pd.merge(), pass in again the left DataFrame fruits, the right DataFrame veggies. And now, instead of using the default "inner" parameter, I’m going to pass "outer" here.

00:37 Now, if you run this, on first sight, this might look familiar. You might think, oh, this is a result I’ve seen before, it does just what concat() does, and it just sticks the two of them together and fills remaining spaces with NaN values. But that’s not entirely true.

00:55 Something is different here, and you can pull up the other operation just to see the difference between the two of them. So if I run pd.concat() and pass in fruits and veggies and just perform the default outer, the first thing you might see is that there’s a relabeling of the indices, but also if you look at the colors inside of the little images, you might notice that there’s two tomatoes in here.

01:24 So if you just concatenate the DataFrames, they get stuck below each other, in this case on the row axis, and everything else gets filled up with NaN values like you’ve seen in a previous lesson, but what happens up here is slightly different.

01:38 So let’s hop back over into a whiteboard session and figure out what happened here. Starting off with the fruits DataFrame, which is four times two—four rows, and two columns—and then also the veggies DataFrame, can three times three,

02:03 you may remember there’s this match on the tomato, right? That exists in the name and image column of both of these DataFrames. Now we have the color additionally here.

02:16 The tomato exists in both, which is what makes the match possible,

02:22 which means that the join operates on this one row. This is where the match happens. This is how pandas knows that it can combine these two DataFrames. Now, the difference is what happens with the rest, because now you pass "outer" to create an outer join instead of an inner join. pandas decides to keep the rest of the information after performing your join. So same as with the inner join before, these two rows, they get combined.

02:51 They don’t get duplicated, because you’re not concatenating the DataFrames, but you’re merging them. You’re bunching them together, so to say. So your output is going to not discard any of the data that hasn’t been matched, so all of this data will stay around.

03:08 But the data where the match happened is going to be combined. So you will have one row containing the tomato that now has additionally also the color. And you will have all the rest of the data persisted. And to make a proper table shape, the rest of the values that don’t have a corresponding value in the color column, because it doesn’t exist over here, is going to get filled with NaN values. Let me hop to the output, and then you can highlight this.

03:41 You still have here name and image—pear, tomato, lemon—of the original fruits DataFrame. And this gets a little tricky to draw in here, but you will see it works out.

03:55 We have potato and carrot, their images as well as their colors, from the vegetable DataFrame, and then also this data around the tomato that has been merged.

04:09 So as you might notice, because of the merge, the tomato of both the fruits and the vegetable DataFrames is now in one row. And you have the additional cells for the colors column that don’t have a representing piece of data inside of the fruits DataFrame.

04:29 These get filled with NaN values again by pandas.

04:33 Now you can see the difference to pd.concat(), where the DataFrames simply got stuck together, and you have a repeating value for tomato that one time comes from within the fruits DataFrame, is the tomato, and one time from the vegetables DataFrame.

04:56 So here you can see the difference between performing a database join and performing a concatenation. The join really combines data and merges it, so you don’t end up with duplicates if you look at the values that are matched between the DataFrames. And in the concatenation, you definitely can end up with duplicates because pandas is not looking at a column to figure out how to match the DataFrames together.

05:25 So as a quick recap, if you perform an inner join, a row that has matching columns gets combined, and everything that doesn’t have a match gets thrown away. That’s when you use the inner join.

05:38 And when you use an outer join, the matched columns still get combined—you can see that in the example of the tomato—but all the other data from the DataFrames that doesn’t have a match doesn’t get thrown away, but instead remains in the output DataFrame.

05:55 And inner and outer join are maybe the most common ones that you might want to use. In the next lesson, you’ll take a look at other arguments that you can pass to the how parameter aside from "inner" and "outer".

06:09 We’ll take a look at the docstring so you can see them as well and explore what the different types of joins do your data.

Become a Member to join the conversation.