Customize Column Suffixes and Perform a Cross Join
In the previous lesson, you’ve learned how you can use
right_index to define which columns or index columns you want to pick to perform your merge operations on in either the left or the right DataFrame.
In those results, you saw often there’s the
_y that comes up if you have duplicate column names, and it might be a little hard to keep track of what is this
_y actually about?
pd.merge() function actually includes also an option to define these suffixes, and it’s aptly named
I could go in here and take the command from the previous lesson. And instead of keeping it
_y, I can add another parameter called
suffixes, which takes an iterable, and here I can define what I want to put on the left one, which I’ll say is gonna be
And then I will put
_v for the vegetables DataFrame. Now, if I run the same command, you’ll see that these suffixes here have changed. And now it’s maybe a little easier to read where did the
name column come from, so we know this is from the fruits DataFrame, and this
image come from the vegetables DataFrame.
And this is helpful for situations where you’re producing really large DataFrames. For example, there is one option in the
how parameter that we haven’t explored yet.
So if you go back here and say
pd.merge() … let’s look at the function signature again. In the docstring on
how, you might have noticed that you played around with
inner, but there’s yet another one called the
And it creates the Cartesian product from both DataFrames, and it preserves the order of the left keys. So this will produce a big DataFrame if you call it even just on these two small DataFrames that you’re working with here. So if I say
pd.merge(fruits, veggies) and then say
then it produces the Cartesian product, which is all possible combinations that it could get, and produces a row for each of those. Now, this is pretty big, and for a big DataFrame like that, it can often help to just be specific and, for example, add this suffixes like you did earlier, and just say, this comes from the
fruits DataFrame and the right DataFrame was the
02:43 And that makes it even a bit bigger. But now you know, clearly and descriptively, where does the data actually come from, and it makes it a bit easier to read. Most of the time that you want to join DataFrames, you’re probably not going to use this Cartesian product, but you want to find these matches and then avoid duplication, which is also why the default is an inner join and just finds all possible intersections of the columns for you.
And this is how you can get such a different result when you run
pd.merge() with the default arguments, where it really just finds this one column, amidst this combination, versus if you use a Cartesian product with a cross join.
In this lesson, you explored yet another possible argument to the
how parameter, which is the string
"cross", which produces a Cartesian product of both DataFrames.
And you’ve also used the
suffixes argument to define what suffixes should be put on the names of duplicate columns in a resulting DataFrame after a merge.
In the next lesson of this course, you’ll stick with the knowledge that you’ve gained so far, but you’ll explore a little more with these different joins that you can create using
pd.merge(), just to get a better feeling for what happens and also figure out what are some possible gotchas that you might run into.
Become a Member to join the conversation.