Combining Multiple Datasets
DataFrames do not always come from a single source. There are times when you will need to combine multiple data sources to create a
DataFrame. Recall the
DataFrame from the previous lesson.
As you can see, the combined
DataFrame contains the rows for
'New York' and
'Barcelona'. Make sure to explicitly set the
sort keyword argument. It is not required, but Pandas recently changed the default value from
False. Until the new versions are widely used, setting the keyword argument explicitly will help avoid confusion.
NaN representing the missing values in the
DataFrame. To eliminate those, set the
join keyword argument to
'inner'. The inner join will only keep rows with indexes in both DataFrames.
DataFrame uses the country name as the index, but the
DataFrame uses the country name as a column. With the
merge() method, specify the column to merge on with the
left_on keyword argument.
The return value includes countries that are present in both the
'country' column in the
DataFrame and the index of the
DataFrame, and this is an inner join. For those rows in the merged data, the column from the
DataFrame were added.
The country data will be added to those in which the index matches, with
NaN for those who don’t. In the next lesson, you’ll push aside the tables and learn how to visualize your data with charts and graphs.
Become a Member to join the conversation.