Working With Missing Data When Sorting in pandas
For more information on concepts covered in this lesson, you can check out Using pandas to Make a Gradebook in Python.
00:00 Working With Missing Data When Sorting in Pandas. Real-world data often has many imperfections. While pandas has several methods you can use to clean your data before sorting, sometimes it’s nice to see which data is missing while you’re sorting.
00:18
You can do that with the na_position
parameter. The subset of the fuel economy data used for this course doesn’t have missing values. To illustrate the use of na_position
, first you’ll need to create some missing data. On-screen, you’ll see code that creates a new column based on the existing mpgData
column, maping True
where mpgData
equals Y
and NaN
where it doesn’t.
00:55
Now you have a new column named mpgData_
that contains both True
and NaN
values. You’ll use this column to see what effect na_position
has when you use the two sort methods. To find out more about using .map()
, check out this Real Python course.
01:15
.sort_values()
accepts a parameter named na_position
, which helps to organize missing data in the column you’re sorting on. If you sort on a column with missing data, then the rows with the missing values will appear at the end of your DataFrame.
01:29 This happens regardless of whether you’re sorting in ascending or descending order. Here’s what your DataFrame looks like when you sort on the column with missing data.
01:44
To change this behavior and have the missing data appear first in the DataFrame, you can set na_position
to first
.
01:58
Now, any missing data from the columns you use to sort on will be shown at the top of your DataFrame. The na_position
parameter only accepts the value last
, which is the default, and first
.
02:10 This is most helpful when you’re first starting to analyze your data and are unsure if there are any missing values.
02:18
.sort_index()
also accepts na_position
. Your DataFrame typically won’t have NaN
values as a part of its index, so this parameter is less useful in .sort_index()
. However, it’s good to know that if your DataFrame does have NaN
in either the row index or a column name, then you can quickly identify this using .sort_index()
and na_position
.
02:42
By default, this parameter is set to last
, which places NaN
values at the end of the sorted result. To change that behavior and have the missing data first in your DataFrame, set na_position
to first
.
02:56 In the next section of the course, you’ll see how you can use sort methods to modify DataFrames.
Become a Member to join the conversation.