Combining Multiple Datasets
DataFrames do not always come from a single source. There are times when you will need to combine multiple data sources to create a
DataFrame. Recall the
DataFrame from the previous lesson.
It has a
employee_count column. Now create another
DataFrame with the same columns.
concat() function in Pandas, these two DataFrames can be combined.
As you can see, the combined
DataFrame contains the rows for
'New York' and
'Barcelona'. Make sure to explicitly set the
sort keyword argument. It is not required, but Pandas recently changed the default value from
False. Until the new versions are widely used, setting the keyword argument explicitly will help avoid confusion.
Notice that the
concat() function combined the DataFrames using rows. By setting the
axis keyword argument to
1, you can combine on columns.
Suppose you have a new
DataFrame with different columns but the same index as the
Now you can call
concat(), give it a list of the DataFrames to combine, and set the
1 to add the new columns to the
NaN representing the missing values in the
DataFrame. To eliminate those, set the
join keyword argument to
'inner'. The inner join will only keep rows with indexes in both DataFrames.
This does not remove the
'Toronto', because the index for
'Toronto' is still in both DataFrames. By default, the join is an outer join, which includes all rows.
The default is also to combine based on the index. Using the
merge() function, you can specify a column to merge on. Take a look at this
DataFrame uses the country name as the index, but the
DataFrame uses the country name as a column. With the
merge() method, specify the column to merge on with the
left_on keyword argument.
Notice also the
right_index keyword argument is set to
True. This means that the
DataFrame on the right side,
countries, will be joined on the index.
The return value includes countries that are present in both the
'country' column in the
DataFrame and the index of the
DataFrame, and this is an inner join. For those rows in the merged data, the column from the
DataFrame were added.
The default is an inner join, but the
how keyword argument can also use another join type. Here, the left join includes all rows in the
The country data will be added to those in which the index matches, with
NaN for those who don’t. In the next lesson, you’ll push aside the tables and learn how to visualize your data with charts and graphs.
Become a Member to join the conversation.