Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set your subtitle preferences in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please refer to our video player troubleshooting guide for assistance.

Combining Multiple Datasets

00:00 DataFrames do not always come from a single source. There are times when you will need to combine multiple data sources to create a DataFrame. Recall the city_data DataFrame from the previous lesson.

00:13 It has a revenue and employee_count column. Now create another DataFrame with the same columns.

00:21 Using the concat() function in Pandas, these two DataFrames can be combined.

00:28 As you can see, the combined DataFrame contains the rows for 'New York' and 'Barcelona'. Make sure to explicitly set the sort keyword argument. It is not required, but Pandas recently changed the default value from True to False. Until the new versions are widely used, setting the keyword argument explicitly will help avoid confusion.

00:51 Notice that the concat() function combined the DataFrames using rows. By setting the axis keyword argument to 1, you can combine on columns.

01:01 Suppose you have a new DataFrame with different columns but the same index as the all_city_data DataFrame.

01:11 Now you can call concat(), give it a list of the DataFrames to combine, and set the axis to 1 to add the new columns to the DataFrame.

01:23 Notice the NaN representing the missing values in the DataFrame. To eliminate those, set the join keyword argument to 'inner'. The inner join will only keep rows with indexes in both DataFrames.

01:37 This does not remove the NaN for 'Toronto', because the index for 'Toronto' is still in both DataFrames. By default, the join is an outer join, which includes all rows.

01:49 The default is also to combine based on the index. Using the merge() function, you can specify a column to merge on. Take a look at this DataFrame.

02:00 The countries DataFrame uses the country name as the index, but the cities DataFrame uses the country name as a column. With the merge() method, specify the column to merge on with the left_on keyword argument.

02:16 Notice also the right_index keyword argument is set to True. This means that the DataFrame on the right side, countries, will be joined on the index.

02:26 The return value includes countries that are present in both the 'country' column in the cities DataFrame and the index of the countries DataFrame, and this is an inner join. For those rows in the merged data, the column from the countries DataFrame were added.

02:43 The default is an inner join, but the how keyword argument can also use another join type. Here, the left join includes all rows in the cities DataFrame.

02:54 The country data will be added to those in which the index matches, with NaN for those who don’t. In the next lesson, you’ll push aside the tables and learn how to visualize your data with charts and graphs.

Become a Member to join the conversation.