Combining Multiple Datasets
00:00
DataFrames do not always come from a single source. There are times when you will need to combine multiple data sources to create a DataFrame
. Recall the city_data
DataFrame
from the previous lesson.
00:13
It has a revenue
and employee_count
column. Now create another DataFrame
with the same columns.
00:21
Using the concat()
function in Pandas, these two DataFrames can be combined.
00:28
As you can see, the combined DataFrame
contains the rows for 'New York'
and 'Barcelona'
. Make sure to explicitly set the sort
keyword argument. It is not required, but Pandas recently changed the default value from True
to False
. Until the new versions are widely used, setting the keyword argument explicitly will help avoid confusion.
00:51
Notice that the concat()
function combined the DataFrames using rows. By setting the axis
keyword argument to 1
, you can combine on columns.
01:01
Suppose you have a new DataFrame
with different columns but the same index as the all_city_data
DataFrame
.
01:11
Now you can call concat()
, give it a list of the DataFrames to combine, and set the axis
to 1
to add the new columns to the DataFrame
.
01:23
Notice the NaN
representing the missing values in the DataFrame
. To eliminate those, set the join
keyword argument to 'inner'
. The inner join will only keep rows with indexes in both DataFrames.
01:37
This does not remove the NaN
for 'Toronto'
, because the index for 'Toronto'
is still in both DataFrames. By default, the join is an outer join, which includes all rows.
01:49
The default is also to combine based on the index. Using the merge()
function, you can specify a column to merge on. Take a look at this DataFrame
.
02:00
The countries
DataFrame
uses the country name as the index, but the cities
DataFrame
uses the country name as a column. With the merge()
method, specify the column to merge on with the left_on
keyword argument.
02:16
Notice also the right_index
keyword argument is set to True
. This means that the DataFrame
on the right side, countries
, will be joined on the index.
02:26
The return value includes countries that are present in both the 'country'
column in the cities
DataFrame
and the index of the countries
DataFrame
, and this is an inner join. For those rows in the merged data, the column from the countries
DataFrame
were added.
02:43
The default is an inner join, but the how
keyword argument can also use another join type. Here, the left join includes all rows in the cities
DataFrame
.
02:54
The country data will be added to those in which the index matches, with NaN
for those who don’t. In the next lesson, you’ll push aside the tables and learn how to visualize your data with charts and graphs.
Become a Member to join the conversation.