Customize Column Suffixes and Perform a Cross Join
00:00
In the previous lesson, you’ve learned how you can use left_on
, right_on
, left_index
, and right_index
to define which columns or index columns you want to pick to perform your merge operations on in either the left or the right DataFrame.
00:17
In those results, you saw often there’s the _x
and _y
that comes up if you have duplicate column names, and it might be a little hard to keep track of what is this _x
and _y
actually about?
00:30
So the pd.merge()
function actually includes also an option to define these suffixes, and it’s aptly named suffixes
.
00:40
I could go in here and take the command from the previous lesson. And instead of keeping it _x
and _y
, I can add another parameter called suffixes
, which takes an iterable, and here I can define what I want to put on the left one, which I’ll say is gonna be _f
for fruits
DataFrame.
01:05
And then I will put _v
for the vegetables DataFrame. Now, if I run the same command, you’ll see that these suffixes here have changed. And now it’s maybe a little easier to read where did the name
column come from, so we know this is from the fruits DataFrame, and this name
and image
come from the vegetables DataFrame.
01:28
And this is helpful for situations where you’re producing really large DataFrames. For example, there is one option in the how
parameter that we haven’t explored yet.
01:39
So if you go back here and say pd.merge()
… let’s look at the function signature again. In the docstring on how
, you might have noticed that you played around with left
, right
, outer
, and inner
, but there’s yet another one called the cross
join.
01:54
And it creates the Cartesian product from both DataFrames, and it preserves the order of the left keys. So this will produce a big DataFrame if you call it even just on these two small DataFrames that you’re working with here. So if I say pd.merge(fruits, veggies)
and then say how="cross"
,
02:16
then it produces the Cartesian product, which is all possible combinations that it could get, and produces a row for each of those. Now, this is pretty big, and for a big DataFrame like that, it can often help to just be specific and, for example, add this suffixes like you did earlier, and just say, this comes from the fruits
DataFrame and the right DataFrame was the veggies
DataFrame.
02:43 And that makes it even a bit bigger. But now you know, clearly and descriptively, where does the data actually come from, and it makes it a bit easier to read. Most of the time that you want to join DataFrames, you’re probably not going to use this Cartesian product, but you want to find these matches and then avoid duplication, which is also why the default is an inner join and just finds all possible intersections of the columns for you.
03:11
And this is how you can get such a different result when you run pd.merge()
with the default arguments, where it really just finds this one column, amidst this combination, versus if you use a Cartesian product with a cross join.
03:29
In this lesson, you explored yet another possible argument to the how
parameter, which is the string "cross"
, which produces a Cartesian product of both DataFrames.
03:39
And you’ve also used the suffixes
argument to define what suffixes should be put on the names of duplicate columns in a resulting DataFrame after a merge.
03:53
In the next lesson of this course, you’ll stick with the knowledge that you’ve gained so far, but you’ll explore a little more with these different joins that you can create using pd.merge()
, just to get a better feeling for what happens and also figure out what are some possible gotchas that you might run into.
Become a Member to join the conversation.