Getting Started With pandas Sort Methods
For more information on the REPL used in these videos, you can check out bpython and the Real Python tutorial Discover bpython: A Python REPL With IDE-Like Features.
00:00 Getting Started With Pandas Sort Methods. As a quick reminder, a DataFrame is a data structure with labeled axes for both rows and columns. You can sort a DataFrame by row or column value, as well as by row or column index. Both rows and columns have indices, which are numerical representations of where the data is in your DataFrame.
00:27 You can retrieve data from specific rows or columns using the DataFrame’s index locations. By default, index numbers start from zero. You can also manually assign your own index.
00:44 In this course, you’ll be working with fuel economy data compiled by the US Environmental Protection Agency (EPA) on vehicles made between 1984 and 2021.
00:55 The EPA fuel economy dataset is great because it has many different types of information that you can sort on, from textual to numeric data types. The dataset contains eighty-three columns in total.
01:09
To follow along, you’ll need to have the pandas
Python library installed.
01:18 The code in this course was executed using Python 3.10 and pandas 1.4 All of the code you see running in a REPL in this course will be run using the Bpython interpreter.
01:31
It offers a number of extra features compared to the standard Python REPL, including color-coding of syntax, which makes it easier to see what’s happening on-screen. However, every command you see will run exactly the same in the standard Python REPL, which typically you will access by typing python
or python3
.
01:51 For analysis purposes, you’ll be looking at miles-per-gallon data on vehicles by make, model, year, and other vehicle attributes. You can specify which columns to read into a DataFrame, and in this course, you’ll only need a subset of the available columns.
02:08
On-screen, you’ll see the commands to read the relevant columns of the fuel economy dataset into a DataFrame and to display the first five rows. First, pandas
is imported with the usual alias of pd
.
02:25 Next, a list of required column names is defined, which will reduce the number of columns from over eighty to a more manageable ten.
02:45 The next line creates a DataFrame by downloading the CSV data from the selected URL, limiting the size of the DataFrame to the first hundred rows. Note that the fuel economy dataset is around eighteen megabytes.
02:58 Reading entire dataset into memory could take a minute or two. Limiting the number of rows and columns will help performance, but it will still take a few seconds before the data is downloaded.
03:14
Finally, the .head()
method is used to view the first five rows of the DataFrame.
03:24
You can use .sort_values()
to sort values in a DataFrame along either axis: columns or rows. Typically you want to sort the rows in a DataFrame by the values of one or more columns.
03:37
The figure seen on-screen shows the result of using .sort_values()
to sort the DataFrame’s rows, depending on the values of the highway08 column.
03:46
This is similar to how you would source data in a spreadsheet using a column. You can use .sort_index()
to sort a DataFrame by its row index or column labels.
03:57
The difference from using .sort_values()
is that you’re sorting the DataFrame based on its row index or column names, not by the value in these rows or columns.
04:08 The row index of the DataFrame is outlined in blue on-screen. An index isn’t considered a column, and you typically have only a single row index. The row index can be thought of as the row numbers, which start from zero.
04:24 In the next section, you’ll get started working with the DataFrame by looking at how you can sort the DataFrame on a single column.
Become a Member to join the conversation.