Practice Your Intuition With Some Peculiarities

Combining Data in pandas With concat() and merge() Martin Breuss 07:10

00:00 In the previous lesson, you’ve learned how you can use suffixes parameter to make the column names in the output of your DataFrame a little more understandable. And in this lesson, you will play around a bit with joins using pd.merge() and run into a couple of maybe unexpected results of your join operations so you can see what’s possible and maybe what you can look out for.

00:23 So let’s get started in the Jupyter notebook.

00:28 For these small play-around examples, I’m going to change the fruits DataFrame a little bit and add another row to it. So currently, fruits DataFrame looks like this, that you have an apple, a pear, tomato, and lemon.

00:41 And I want to add an orange here as well. And for this, I’m going to use pd.concat() because all I want to do is actually concatenate a row at the bottom of this DataFrame.

00:52 The best way to do this is say fruits = pd.concat(). And then I want to pass in an iterable that consists first of the current fruits DataFrame, and then I also want to add a new DataFrame that just has a single row that contains the orange. Let me put this in here.

01:16 And then I also want to ignore the index to make sure that it counts from 0 up to 4 and doesn’t start at 0 again. So I can pass the ignore_index attribute and set it to True.

01:30 If I run this and check fruits again, you can see that you have a new row at the bottom of the fruits DataFrame that now has ten oranges in it.

01:41 Now why did I add this orange? Just to show you how these combinations can work and how a piece of data is just a piece of data, and pandas doesn’t really have an understanding of what it means. Think back to your veggies DataFrame.

01:55 You also have something in here that’s called an orange, but here it’s about the color of the carrot. Now you could perform a merge on these two DataFrames when specifying that you want to use the name column of the fruits DataFrame and the color column on the veggies DataFrame. So give it a try.

02:14 You can say pd.merge(fruits, veggies) and then pause for a second and think about how you would make that happen.

02:24 You might remember that you have to pass the left_on parameter, where you can define the columns that you want to use in the left DataFrame.

02:32 In this case, this would be name. And then you can pass also right_on. And in this case, you want to use the color.

02:42 And when you execute that, you get a new DataFrame with a single row that has a combination of the row that matched here in the fruits DataFrame, which has the index 4 and has an orange as the name, and then also the color in the veggies DataFrame.

03:00 And you get the appropriate name_x, image_x so that you know where they came from. amount only existed once, so it doesn’t get a suffix.

03:08 And then you have the corresponding items from the veggies DataFrame. Now this doesn’t necessarily make sense, although it might, depending on what you want to do with your data.

03:56 You can see that there is no match. There is no row in these two DataFrames where there is the same piece of data in the image column of the left DataFrame and the name column of the right DataFrame.

04:44 You can now see that there’s a comma space to the numeric values. So you might wonder what happened here. Let’s assign this to a new DataFrame that we can inspect it a little more.

04:56 So I’m going to say merged calls pd.merge(), and then merged.dtypes gives you the information that the amount is now a float64 data type, while for the fruits DataFrame,

05:14 it is an int64 data type.

05:18 You might wonder, why did this change in data type happen? And the reason simply is that there were a couple of NaN values, and NaN values in pandas are represented as instances of floats. And because pandas needs to keep a data type for the whole column, it changes the integer values that it can change into floating-type numbers.

05:42 And you’ll see that this does not happen when you run pd.merge() with an inner join,

05:50 because in this case, you don’t have any NaN values, which means that the data types remain the same. And this is like an automatic conversion that pandas has to perform when there’s any NaN values in a numeric column.

06:07 And this sums up a couple of gotchas. What you’ve seen in this lesson is that you can perform merge operations on pieces of data that look the same to the computer, even though they’re not the same for you as the viewer.

06:20 So it’s important that you know your dataset, and you understand why and how you want to join specific tables and what you want to join them on. You’ve seen that if you perform a join using pd.merge() and there are no matching pieces of data, then you will end up with an empty DataFrame.

06:39 And you’ve also seen that if you perform a merge that produces NaN values in a numeric column, this column gets automatically cast to float64, because this is how pandas internally represents NaN values.

06:55 And with this, you are at the end of this course on combining data using pandas concat() and pandas merge(). In the next lesson, you’ll see a quick recap of all the content that you’ve covered and all the things that you’ve learned.

Become a Member to join the conversation.