Understand the Differences Between Left and Right Joins
In the previous lesson, you’ve seen what happens when you use an outer join on two DataFrames when using the
pd.merge() function. In this lesson, you’ll take a look at using
pd.merge() with yet another argument to the
how parameter, and you’ll use first left, and then try out a right join.
These are somewhat equivalent to SQL left outer joins and right outer joins. So as you might expect, they do something similar to what you’ve seen with
Now let’s take a look in here when I type
pd.merge() passing in, again, left the
fruits DataFrame and right the
veggies DataFrame, and now pass to
how just the string
"left". pandas is going to focus on the left DataFrame, so
fruits in this case, and it’s going to take the key indices from that DataFrame to perform the merge.
This means that the result is somewhat reminiscent of the
fruits DataFrame that you started out with, but it was able to find a match on the tomato again, and therefore added the value of
red, so the only one that had a value for the color column was the tomato.
And so it added the value there and filled the rest with
NaN values. So basically what you get as an output here
is first part of this join, which relates to the keys from your original
fruits DataFrame. These were taken as the keys, and you can see them represented in here as well. Again, pandas performed a join operation, so it merged the two DataFrames.
It didn’t do any concatenation, but it performed a join. It found that inside of the vegetables DataFrame, there was a match on the
tomato row, both the name and the image, and therefore it added the
color row and a value for that one column while filling the ones that didn’t have a match with
NaN values. With this in mind, you might already have a good idea what happens when you do a right join.
02:19 So let’s try that out next.
02:24 You can pause the video for a second and try to figure out what’s going to happen.
02:31 So here’s your result for this join, and against what you might have expected, it’s not the remaining columns down here—it’s still potato and carrot, but something different happened.
02:46 What happened here is that pandas took into account the indices of the right DataFrame—in your case, that’s the vegetables DataFrame—
02:58 and figured out where do you have a match with the other DataFrames and then created a merge operation on the rows that have a match and keep the data based on the indices of the right DataFrame that don’t have a match in the left DataFrame as well.
So what you are getting here looks very similar to the initial `veggies1 DataFrame. This is your
veggies DataFrame, and this is your output that you get when you perform a right outer join with
pd.merge() and passing
03:32 And the reason, in this case, that you essentially just get a copy of that same DataFrame, but it isn’t quite a copy—as you can see, the order is different of the columns, but the information in there is the same—and that’s simply because there is no additional data in the fruits DataFrame if you start off by considering the indices of the vegetables DataFrame.
So if you focus on
image columns of the vegetable DataFrame, you will again get one match that has tomato and the image also in the fruit DataFrame, which means that you will combine both of those.
04:10 But because your focus is on the indices of the veggies DataFrame on the right one, in this case, it means that you’re thinking about this key, this key, and this key.
They’re all composite keys of
image. So the first one is
tomato and the image of the tomato, which gets a match.
The second one is
potato and the image of the potato, which doesn’t get a match, and then
carrot and the image of the carrot, which also doesn’t get a match. But since you’re performing an outer join, the information doesn’t get thrown away, but it’s kept around, but there’s no additional information to add from the fruits DataFrame,
04:53 which is why in this specific case, you essentially end up with a copy of the right DataFrame.
05:00 Keep in mind though that if your vegetables DataFrame looked differently, or if your fruits DataFrame looked differently, this results could be quite different. For example, if you added a column to the fruits DataFrame—so let’s say,
05:30 I guess the lemon should probably not be sweet. Let me fix that. So we’ve got a sour tomato and a sour lemon and a sweet apple and a sweet pear. So there’s some additional information now inside of that fruits DataFrame that wasn’t present before.
And now if I merge these two DataFrames using
pd.merge(fruits, veggies), again using
how="right"—so focusing on the keys of the vegetable DataFrame, you will see that the output is quite different than before, because now you get the
taste column included because there was on one of the match rows, which in this case is only the row that contains information about the tomato, there was also additional information about its taste inside of the fruits DataFrame, and that had to be included. Now, again, you’re only focusing on the indices of the right DataFrame, which is the vegetables DataFrame in this case.
So the other two rows that are going to stay over are about the potato and the carrot. And since they don’t have information about the taste, these again get filled with
So this result is similar to what you’ve seen before when you saw the
color column get added to the fruit DataFrame When you ran left outer join—let me scroll up there—
when you ran this left outer join here, and you got the
NaN values for the
color column that got added from the vegetable DataFrame. So in short, the difference between a left and the right join is that with the left join, pandas uses only the keys from the left frame, and with the right join, it uses only the keys from the right frame. This doesn’t change which key columns are used for the join, but you can define that as well,
07:28 and you’ll learn how to specify the join columns explicitly in the next lesson.
Become a Member to join the conversation.