Revisit Combining Data Using merge()
In this lesson, you’ll do a quick section recap of Combining Data Using
pd.merge(). Started off this section by using
pd.merge() in the default way, when you just pass in a left DataFrame and the right DataFrame, and this implicitly used the default argument of
how="inner" to perform an inner join on the two DataFrames.
Now I’m going to restructure this a bit because you’ll keep adding a bunch of other keyword arguments now. So let’s restructure it like this. So you have
pd.merge() with the left and right DataFrames, and then
how would the default
"inner". But then in the next video, you also checked out how you can change this to perform an outer join instead, and what effects that has on the resulting DataFrame.
00:44 And then you explored a couple of other ways of joining two DataFrames by performing a left join and a right join, as well as later in the course, a cross join.
Next, you took a look at how you can specify which columns you want to perform to join operations on. By default, this has a value of
None, which means that pandas is going to figure out which columns represent the intersection of the two DataFrames and use those for the
In the case of these two DataFrames, that was an iterable of
"image", then you also tried to pass them separately, just
"name" and just
"image", and explore what the difference in the outcome is, which is duplicated values, because you would have matches on both of these columns.
01:29 So if you only specify one, then the other one is going to be duplicated.
Then you also took a look at what happens if you pass a column name that only exists in one of the two DataFrames, which leads to a
KeyError. And in the next lesson, you took a look at additional keyword parameters that allow you to very flexibly define which columns you want to use for the join operations.
01:50 So you can specify which column or columns to use in the left DataFrame, in the right one, and then you can also choose to use index columns or a combination of index columns as well as named columns.
And you tried that out by using both index columns by setting these values to
True. And then you also tried a combination of one index column and a new column you added to the left DataFrame, which was
amount and produced a result.
Whether it was useful or not is a different question. Finally, you took a look at the
suffixes keyword argument that has a default of a tuple that contains two strings, which are
"_y", which is what pandas adds to the duplicate column names of a DataFrame.
And you explored how you can change that by doing, for example, passing in another iterable with two strings that contain the names of the DataFrames that you were merging. Now, just like
pd.merge() has more than just these keyword arguments.
So I would encourage you to take a look at the docstring of
pd.merge() and explore also the other keyword arguments that can pass in there.
03:00 Read in the documentation and figure out what they do with some practical examples.
And that’s it for the section recap on
pd.merge(). In the next and final lesson of this course, we’ll do another quick overview, a summary of the whole course, where I’ll also show you a couple of resources that you can go to to learn more about combining data using pandas.
Become a Member to join the conversation.