Practice Your Intuition With Some Peculiarities
00:00
In the previous lesson, you’ve learned how you can use suffixes
parameter to make the column names in the output of your DataFrame a little more understandable. And in this lesson, you will play around a bit with joins using pd.merge()
and run into a couple of maybe unexpected results of your join operations so you can see what’s possible and maybe what you can look out for.
00:23 So let’s get started in the Jupyter notebook.
00:28
For these small play-around examples, I’m going to change the fruits
DataFrame a little bit and add another row to it. So currently, fruits
DataFrame looks like this, that you have an apple, a pear, tomato, and lemon.
00:41
And I want to add an orange here as well. And for this, I’m going to use pd.concat()
because all I want to do is actually concatenate a row at the bottom of this DataFrame.
00:52
The best way to do this is say fruits = pd.concat()
. And then I want to pass in an iterable that consists first of the current fruits
DataFrame, and then I also want to add a new DataFrame that just has a single row that contains the orange. Let me put this in here.
01:16
And then I also want to ignore the index to make sure that it counts from 0
up to 4
and doesn’t start at 0
again. So I can pass the ignore_index
attribute and set it to True
.
01:30
If I run this and check fruits again, you can see that you have a new row at the bottom of the fruits
DataFrame that now has ten oranges in it.
01:41
Now why did I add this orange? Just to show you how these combinations can work and how a piece of data is just a piece of data, and pandas doesn’t really have an understanding of what it means. Think back to your veggies
DataFrame.
01:55
You also have something in here that’s called an orange, but here it’s about the color of the carrot. Now you could perform a merge on these two DataFrames when specifying that you want to use the name
column of the fruits
DataFrame and the color
column on the veggies
DataFrame. So give it a try.
02:14
You can say pd.merge(fruits, veggies)
and then pause for a second and think about how you would make that happen.
02:24
You might remember that you have to pass the left_on
parameter, where you can define the columns that you want to use in the left DataFrame.
02:32
In this case, this would be name
. And then you can pass also right_on
. And in this case, you want to use the color.
02:42
And when you execute that, you get a new DataFrame with a single row that has a combination of the row that matched here in the fruits
DataFrame, which has the index 4
and has an orange as the name, and then also the color in the veggies
DataFrame.
03:00
And you get the appropriate name_x
, image_x
so that you know where they came from. amount
only existed once, so it doesn’t get a suffix.
03:08
And then you have the corresponding items from the veggies
DataFrame. Now this doesn’t necessarily make sense, although it might, depending on what you want to do with your data.
03:18
I just want to show you that these merge operations are very versatile, and you can do a lot of stuff with it. Now, what happens when you have these two DataFrames, and you use column names that exist on both of them, but don’t have a match? For example, if you define that you want to perform the merge on fruits, veggies, left_on="image"
and right_on="name"
, so you’re explicitly specifying which columns you want to use from the two DataFrames, they exist in these DataFrames, and you perform this merge.
03:56
You can see that there is no match. There is no row in these two DataFrames where there is the same piece of data in the image
column of the left DataFrame and the name
column of the right DataFrame.
04:08
So your output is not going to be an error. It’s just going to be an empty DataFrame. So this is something to keep in mind. And for a final gotcha, I want you to perform another merge operation, where you put together fruits
and veggies
not with a default inner join, but you want to do it how="outer"
. And then pay attention to what happens in the amount
column. So if I run this, you can see the additional fields get filled with NaN
values, as you already have seen in previous lessons, but something also changed in the amount
column.
04:44 You can now see that there’s a comma space to the numeric values. So you might wonder what happened here. Let’s assign this to a new DataFrame that we can inspect it a little more.
04:56
So I’m going to say merged
calls pd.merge()
, and then merged.dtypes
gives you the information that the amount is now a float64
data type, while for the fruits
DataFrame,
05:14
it is an int64
data type.
05:18
You might wonder, why did this change in data type happen? And the reason simply is that there were a couple of NaN
values, and NaN
values in pandas are represented as instances of floats. And because pandas needs to keep a data type for the whole column, it changes the integer values that it can change into floating-type numbers.
05:42
And you’ll see that this does not happen when you run pd.merge()
with an inner join,
05:50
because in this case, you don’t have any NaN
values, which means that the data types remain the same. And this is like an automatic conversion that pandas has to perform when there’s any NaN
values in a numeric column.
06:07 And this sums up a couple of gotchas. What you’ve seen in this lesson is that you can perform merge operations on pieces of data that look the same to the computer, even though they’re not the same for you as the viewer.
06:20
So it’s important that you know your dataset, and you understand why and how you want to join specific tables and what you want to join them on. You’ve seen that if you perform a join using pd.merge()
and there are no matching pieces of data, then you will end up with an empty DataFrame.
06:39
And you’ve also seen that if you perform a merge that produces NaN
values in a numeric column, this column gets automatically cast to float64, because this is how pandas internally represents NaN
values.
06:55
And with this, you are at the end of this course on combining data using pandas concat()
and pandas merge()
. In the next lesson, you’ll see a quick recap of all the content that you’ve covered and all the things that you’ve learned.
Become a Member to join the conversation.