Practice Your Intuition With Some Peculiarities
In the previous lesson, you’ve learned how you can use
suffixes parameter to make the column names in the output of your DataFrame a little more understandable. And in this lesson, you will play around a bit with joins using
pd.merge() and run into a couple of maybe unexpected results of your join operations so you can see what’s possible and maybe what you can look out for.
00:23 So let’s get started in the Jupyter notebook.
For these small play-around examples, I’m going to change the
fruits DataFrame a little bit and add another row to it. So currently,
fruits DataFrame looks like this, that you have an apple, a pear, tomato, and lemon.
And I want to add an orange here as well. And for this, I’m going to use
pd.concat() because all I want to do is actually concatenate a row at the bottom of this DataFrame.
The best way to do this is say
fruits = pd.concat(). And then I want to pass in an iterable that consists first of the current
fruits DataFrame, and then I also want to add a new DataFrame that just has a single row that contains the orange. Let me put this in here.
And then I also want to ignore the index to make sure that it counts from
0 up to
4 and doesn’t start at
0 again. So I can pass the
ignore_index attribute and set it to
If I run this and check fruits again, you can see that you have a new row at the bottom of the
fruits DataFrame that now has ten oranges in it.
Now why did I add this orange? Just to show you how these combinations can work and how a piece of data is just a piece of data, and pandas doesn’t really have an understanding of what it means. Think back to your
You also have something in here that’s called an orange, but here it’s about the color of the carrot. Now you could perform a merge on these two DataFrames when specifying that you want to use the
name column of the
fruits DataFrame and the
color column on the
veggies DataFrame. So give it a try.
You can say
pd.merge(fruits, veggies) and then pause for a second and think about how you would make that happen.
You might remember that you have to pass the
left_on parameter, where you can define the columns that you want to use in the left DataFrame.
In this case, this would be
name. And then you can pass also
right_on. And in this case, you want to use the color.
And when you execute that, you get a new DataFrame with a single row that has a combination of the row that matched here in the
fruits DataFrame, which has the index
4 and has an orange as the name, and then also the color in the
And you get the appropriate
image_x so that you know where they came from.
amount only existed once, so it doesn’t get a suffix.
And then you have the corresponding items from the
veggies DataFrame. Now this doesn’t necessarily make sense, although it might, depending on what you want to do with your data.
I just want to show you that these merge operations are very versatile, and you can do a lot of stuff with it. Now, what happens when you have these two DataFrames, and you use column names that exist on both of them, but don’t have a match? For example, if you define that you want to perform the merge on
fruits, veggies, left_on="image" and
right_on="name", so you’re explicitly specifying which columns you want to use from the two DataFrames, they exist in these DataFrames, and you perform this merge.
You can see that there is no match. There is no row in these two DataFrames where there is the same piece of data in the
image column of the left DataFrame and the
name column of the right DataFrame.
So your output is not going to be an error. It’s just going to be an empty DataFrame. So this is something to keep in mind. And for a final gotcha, I want you to perform another merge operation, where you put together
veggies not with a default inner join, but you want to do it
how="outer". And then pay attention to what happens in the
amount column. So if I run this, you can see the additional fields get filled with
NaN values, as you already have seen in previous lessons, but something also changed in the
04:44 You can now see that there’s a comma space to the numeric values. So you might wonder what happened here. Let’s assign this to a new DataFrame that we can inspect it a little more.
So I’m going to say
pd.merge(), and then
merged.dtypes gives you the information that the amount is now a
float64 data type, while for the
it is an
int64 data type.
You might wonder, why did this change in data type happen? And the reason simply is that there were a couple of
NaN values, and
NaN values in pandas are represented as instances of floats. And because pandas needs to keep a data type for the whole column, it changes the integer values that it can change into floating-type numbers.
And you’ll see that this does not happen when you run
pd.merge() with an inner join,
because in this case, you don’t have any
NaN values, which means that the data types remain the same. And this is like an automatic conversion that pandas has to perform when there’s any
NaN values in a numeric column.
06:07 And this sums up a couple of gotchas. What you’ve seen in this lesson is that you can perform merge operations on pieces of data that look the same to the computer, even though they’re not the same for you as the viewer.
So it’s important that you know your dataset, and you understand why and how you want to join specific tables and what you want to join them on. You’ve seen that if you perform a join using
pd.merge() and there are no matching pieces of data, then you will end up with an empty DataFrame.
And you’ve also seen that if you perform a merge that produces
NaN values in a numeric column, this column gets automatically cast to float64, because this is how pandas internally represents
And with this, you are at the end of this course on combining data using pandas
concat() and pandas
merge(). In the next lesson, you’ll see a quick recap of all the content that you’ve covered and all the things that you’ve learned.
Become a Member to join the conversation.