Perform an Outer Join Using merge()
So over here in a Jupyter notebook, I’m going to type in
pd.merge(), pass in again the left DataFrame
fruits, the right DataFrame
veggies. And now, instead of using the default
"inner" parameter, I’m going to pass
Now, if you run this, on first sight, this might look familiar. You might think, oh, this is a result I’ve seen before, it does just what
concat() does, and it just sticks the two of them together and fills remaining spaces with
NaN values. But that’s not entirely true.
Something is different here, and you can pull up the other operation just to see the difference between the two of them. So if I run
pd.concat() and pass in
veggies and just perform the default outer, the first thing you might see is that there’s a relabeling of the indices, but also if you look at the colors inside of the little images, you might notice that there’s two tomatoes in here.
So if you just concatenate the DataFrames, they get stuck below each other, in this case on the row axis, and everything else gets filled up with
NaN values like you’ve seen in a previous lesson, but what happens up here is slightly different.
So let’s hop back over into a whiteboard session and figure out what happened here. Starting off with the
fruits DataFrame, which is four times two—four rows, and two columns—and then also the
veggies DataFrame, can three times three,
which means that the join operates on this one row. This is where the match happens. This is how pandas knows that it can combine these two DataFrames. Now, the difference is what happens with the rest, because now you pass
"outer" to create an outer join instead of an inner join. pandas decides to keep the rest of the information after performing your join. So same as with the inner join before, these two rows, they get combined.
02:51 They don’t get duplicated, because you’re not concatenating the DataFrames, but you’re merging them. You’re bunching them together, so to say. So your output is going to not discard any of the data that hasn’t been matched, so all of this data will stay around.
But the data where the match happened is going to be combined. So you will have one row containing the tomato that now has additionally also the color. And you will have all the rest of the data persisted. And to make a proper table shape, the rest of the values that don’t have a corresponding value in the color column, because it doesn’t exist over here, is going to get filled with
NaN values. Let me hop to the output, and then you can highlight this.
04:09 So as you might notice, because of the merge, the tomato of both the fruits and the vegetable DataFrames is now in one row. And you have the additional cells for the colors column that don’t have a representing piece of data inside of the fruits DataFrame.
Now you can see the difference to
pd.concat(), where the DataFrames simply got stuck together, and you have a repeating value for tomato that one time comes from within the fruits DataFrame, is the tomato, and one time from the vegetables DataFrame.
04:56 So here you can see the difference between performing a database join and performing a concatenation. The join really combines data and merges it, so you don’t end up with duplicates if you look at the values that are matched between the DataFrames. And in the concatenation, you definitely can end up with duplicates because pandas is not looking at a column to figure out how to match the DataFrames together.
05:38 And when you use an outer join, the matched columns still get combined—you can see that in the example of the tomato—but all the other data from the DataFrames that doesn’t have a match doesn’t get thrown away, but instead remains in the output DataFrame.
And inner and outer join are maybe the most common ones that you might want to use. In the next lesson, you’ll take a look at other arguments that you can pass to the
how parameter aside from
Become a Member to join the conversation.