In this tutorial, you’ll learn how to reset a pandas DataFrame index, the reasons why you might want to do this, and the problems that could occur if you don’t.
Before you start your learning journey, you should familiarize yourself with how to create a pandas DataFrame. Knowing the difference between a DataFrame and a pandas Series will also prove useful to you.
In addition, you may want to use the data analysis tool Jupyter Notebook as you work through the examples in this tutorial. Alternatively, JupyterLab will give you an enhanced notebook experience, but feel free to use any Python environment you wish.
As a starting point, you’ll need some data. To begin with, you’ll use the band_members.csv
file included in the downloadable materials that you can access by clicking the link below:
Get Your Code: Click here to download the free sample code you’ll use to learn how to reset a pandas DataFrame index.
The table below describes the data from band_members.csv
that you’ll begin with:
Column Name | PyArrow Data Type | Description |
---|---|---|
first_name |
string |
First name of member |
last_name |
string |
Last name of member |
instrument |
string |
Main instrument played |
date_of_birth |
string |
Member’s date of birth |
As you’ll see, the data has details of the members of the rock band The Beach Boys. Each row contains information about its various members both past and present.
Note: In case you’ve never heard of The Beach Boys, they’re an American rock band formed in the early 1960s.
Throughout this tutorial, you’ll be using the pandas library to allow you to work with DataFrames, as well as the newer PyArrow library. The PyArrow library provides pandas with its own optimized data types, which are faster and less memory-intensive than the traditional NumPy types that pandas uses by default.
If you’re working at the command line, you can install both pandas
and pyarrow
using the single command python -m pip install pandas pyarrow
. If you’re working in a Jupyter Notebook, you should use !python -m pip install pandas pyarrow
. Regardless, you should do this within a virtual environment to avoid clashes with the libraries you use in your global environment.
Once you have the libraries in place, it’s time to read your data into a DataFrame:
>>> import pandas as pd
>>> beach_boys = pd.read_csv(
... "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")
First, you used import pandas
to make the library available within your code. To construct the DataFrame and read it into the beach_boys
variable, you used pandas’ read_csv()
function, passing band_members.csv
as the file to read. Finally, by passing dtype_backend="pyarrow"
to .convert_dtypes()
you convert all columns to pyarrow
types.
If you want to verify that pyarrow
data types are indeed being used, then beach_boys.dtypes
will satisfy your curiosity:
>>> beach_boys.dtypes
first_name string[pyarrow]
last_name string[pyarrow]
instrument string[pyarrow]
date_of_birth string[pyarrow]
dtype: object
As you can see, each data type contains [pyarrow]
in its name.
If you wanted to analyze the date information thoroughly, then you would parse the date_of_birth
column to make sure dates are read as a suitable pyarrow
date type. This would allow you to analyze by specific days, months or years, and so on, as commonly found in pivot tables.
The date_of_birth
column is not analyzed in this tutorial, so the string
data type it’s being read as will do. Later on, you’ll get the chance to hone your skills with some exercises. The solutions include the date parsing code if you want to see how it’s done.
Now that the file has been loaded into a DataFrame, you’ll probably want to take a look at it:
>>> beach_boys
first_name last_name instrument date_of_birth
0 Brian Wilson Bass 20-Jun-1942
1 Mike Love Saxophone 15-Mar-1941
2 Al Jardine Guitar 03-Sep-1942
3 Bruce Johnston Bass 27-Jun-1942
4 Carl Wilson Guitar 21-Dec-1946
5 Dennis Wilson Drums 04-Dec-1944
6 David Marks Guitar 22-Aug-1948
7 Ricky Fataar Drums 05-Sep-1952
8 Blondie Chaplin Guitar 07-Jul-1951
DataFrames are two-dimensional data structures similar to spreadsheets or database tables. A pandas DataFrame can be considered a set of columns, with each column being a pandas Series. Each column also has a heading, which is the name
property of the Series, and each row has a label, which is referred to as an element of its associated index object.
The DataFrame’s index is shown to the left of the DataFrame. It’s not part of the original band_members.csv
source file, but is added as part of the DataFrame creation process. It’s this index object you’re learning to reset.
The index of a DataFrame is an additional column of labels that helps you identify rows. When used in combination with column headings, it allows you to access specific data within your DataFrame. The default index labels are a sequence of integers, but you can use strings to make them more meaningful. You can actually use any hashable type for your index, but integers, strings, and timestamps are the most common.
Note: Although indexes are certainly useful in pandas, an alternative to pandas is the new high-performance Polars library, which eliminates them in favor of row numbers. This may come as a surprise, but aside from being used for selecting rows or columns, indexes aren’t often used when analyzing DataFrames. Also, row numbers always remain sequential when rows are added or removed in a Polars DataFrame. This isn’t the case with indexes in pandas.
With these preliminaries out of the way, it’s time to take your Little Deuce Coupe to the beach, do a pop-up, and surf through some learning. You’ll now investigate the main ways of re-indexing a DataFrame. You could apply an Index
object directly to the DataFrame’s .index
property, or use the DataFrame’s .set_axis()
method. To begin with, you’ll use the DataFrame’s .reset_index()
method.
Get Your Code: Click here to download the free sample code you’ll use to learn how to reset a pandas DataFrame index.
Take the Quiz: Test your knowledge with our interactive “How to Reset a pandas DataFrame Index” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Reset a pandas DataFrame IndexThis quiz will challenge your knowledge of resetting indexes in pandas DataFrames. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.
How to Reset a pandas DataFrame Index With .reset_index()
The DataFrame.reset_index()
method, as its name suggests, is used to reset the index of a pandas DataFrame back to its default. You can also use this method to reset the MultiIndex
on your DataFrame if it has one. You’ll learn more about how to reset multi-indexes later.
Suppose you’ve been performing some data cleansing on your DataFrame and have decided to sort it alphabetically by first_name
. You can do this using .sort_values()
:
>>> beach_boys.sort_values(by="first_name")
first_name last_name instrument date_of_birth
2 Al Jardine Guitar 03-Sep-1942
8 Blondie Chaplin Guitar 07-Jul-1951
0 Brian Wilson Bass 20-Jun-1942
3 Bruce Johnston Bass 27-Jun-1942
4 Carl Wilson Guitar 21-Dec-1946
6 David Marks Guitar 22-Aug-1948
5 Dennis Wilson Drums 04-Dec-1944
1 Mike Love Saxophone 15-Mar-1941
7 Ricky Fataar Drums 05-Sep-1952
While you’re happy that the sort has worked perfectly, you’re not happy with the current state of its index. The index no longer bears any resemblance to the updated order of the rows.
To fix this, your first thought might be to use .reset_index()
with its default parameters, but the result might not be what you want:
>>> beach_boys.sort_values(by="first_name").reset_index()
index first_name last_name instrument date_of_birth
0 2 Al Jardine Guitar 03-Sep-1942
1 8 Blondie Chaplin Guitar 07-Jul-1951
2 0 Brian Wilson Bass 20-Jun-1942
3 3 Bruce Johnston Bass 27-Jun-1942
4 4 Carl Wilson Guitar 21-Dec-1946
5 6 David Marks Guitar 22-Aug-1948
6 5 Dennis Wilson Drums 04-Dec-1944
7 1 Mike Love Saxophone 15-Mar-1941
8 7 Ricky Fataar Drums 05-Sep-1952
You now have a new default index, which is a sequential series of numbers starting at zero that clearly defines the row order. However, the original index has remained and has been moved to a new column that’s confusingly named index
. This may be acceptable if you need to return the DataFrame to its original sort order in the future, but more often than not, its inclusion just wastes memory. You’ll usually want to get rid of it.
Fortunately, this is easy to do. By default, .reset_index()
makes a copy of the original DataFrame, applies the default index to this second copy, and returns that second copy back to you. This means that the index on the original DataFrame remains unchanged, so no harm has been done to it.
You can instruct .reset_index()
to drop the original index completely and replace it with a fresh default one by setting its drop
parameter to True
. You can also re-assign the original reference to the new DataFrame. The updated version will then be the only one that exists from this point forward:
>>> beach_boys = (
... beach_boys.sort_values(by="first_name")
... .reset_index(drop=True)
... )
>>> beach_boys
first_name last_name instrument date_of_birth
0 Al Jardine Guitar 03-Sep-1942
1 Blondie Chaplin Guitar 07-Jul-1951
2 Brian Wilson Bass 20-Jun-1942
3 Bruce Johnston Bass 27-Jun-1942
4 Carl Wilson Guitar 21-Dec-1946
5 David Marks Guitar 22-Aug-1948
6 Dennis Wilson Drums 04-Dec-1944
7 Mike Love Saxophone 15-Mar-1941
8 Ricky Fataar Drums 05-Sep-1952
You’ve now tidied up your DataFrame the way you want it. You’ve also re-assigned the original reference to the new DataFrame, meaning that irritating index
column has been sunk without a trace.
Note: The original DataFrame remains unchanged because .reset_index()
, by default, returns a copy of the original DataFrame. You can set the inplace
parameter to True
to change the original DataFrame.
Over time, the developers of pandas have designed most methods to return DataFrame copies by default so that any incorrect analysis could easily be redone correctly. If each method updated the DataFrame in place, any errors would need a rerun of all earlier processing.
One common misconception is that by setting inplace=True
you’ll save processing time. This generally isn’t accurate. Under the Python hood, a DataFrame copy is often created and worked on, but the re-indexed copy is assigned to the original DataFrame’s variable for you.
It’s recommended not to use this parameter as it’s scheduled for deprecation in a future pandas release. The reason for the deprecation is that setting inplace
to True
changes the underlying data in your original DataFrame. Any code that references your original DataFrame from elsewhere in your program will reference the changed data, which may introduce difficult-to-find bugs.
Setting inplace=True
also makes method chaining impossible. Method chaining is where you apply successive methods that change an original DataFrame, instead of creating lots of intermediate mutating DataFrames. But chaining only works if the methods used in the chain return a DataFrame.
If you apply a DataFrame method to the result of reset_index()
when inplace
has been set to True
, then there would be no DataFrame returned to apply it to, and your code would break.
Now it’s time for you to have some fun fun fun and try the following challenge:
You’ve already seen how it’s possible to reset an index while retaining the old one. When you do this, pandas adds it as a new column with the title index
. Take a look at the documentation for .reset_index()
and see if you can figure out how to customize this new column’s name.
Now re-read the data from band_members.csv
into a fresh DataFrame, and use beach_boys.index = range(1, 10)
to update its index. Finally, use .reset_index()
to see if you can customize your DataFrame and make it look like this:
>>> beach_boys
old_index first_name last_name instrument date_of_birth
0 1 Brian Wilson Bass 20-Jun-1942
1 2 Mike Love Saxophone 15-Mar-1941
2 3 Al Jardine Guitar 03-Sep-1942
3 4 Bruce Johnston Bass 27-Jun-1942
4 5 Carl Wilson Guitar 21-Dec-1946
5 6 Dennis Wilson Drums 04-Dec-1944
6 7 David Marks Guitar 22-Aug-1948
7 8 Ricky Fataar Drums 05-Sep-1952
8 9 Blondie Chaplin Guitar 07-Jul-1951
You’re aiming to make sure the existing index isn’t only retained, but renamed as well.
You’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
Although the .reset_index()
method you’ve just learned about is the most customizable way to reset a pandas DataFrame index, allowing you to handle multi-indexes, it’s certainly not the only way. It’s possible to do it the gremmie way by using first principles.
Reset an Index Directly With .index
When working with DataFrames, you can identify rows using .loc[]
or .iloc[]
. Each has its use case, although either can be used on any DataFrame.
A DataFrame’s index is referenced using its .index
property. To replace the current index with one of your own, you assign .index
an iterable containing your new index labels. This allows you lots of scope to customize your DataFrame indexes beyond the default incrementing numbers.
Before you explore .loc[]
and .iloc[]
, you’ll add a custom index to your DataFrame that contains the initials of the Beach Boys band members. First, you’ll read the band members’ data from a CSV file and convert it to pyarrow
data types:
>>> beach_boys = pd.read_csv(
... "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")
>>> initials = ["BW", "ML", "AJ", "BJ", "CW", "DW", "DM", "RF", "BC"]
>>> beach_boys.index = initials
>>> beach_boys
first_name last_name instrument date_of_birth
BW Brian Wilson Bass 20-Jun-1942
ML Mike Love Saxophone 15-Mar-1941
AJ Al Jardine Guitar 03-Sep-1942
BJ Bruce Johnston Bass 27-Jun-1942
CW Carl Wilson Guitar 21-Dec-1946
DW Dennis Wilson Drums 04-Dec-1944
DM David Marks Guitar 22-Aug-1948
RF Ricky Fataar Drums 05-Sep-1952
BC Blondie Chaplin Guitar 07-Jul-1951
You’ll take a look at an alternative way of resetting an index in a short while, but first, you’ll dive into row selection.
Select Rows Using .loc[]
and .iloc[]
The .loc[]
attribute allows you to index rows by their index label. This is particularly useful when the underlying index labels have intrinsic meaning within the DataFrame, such as usernames or timestamps.
If you slice with .loc[]
, it uses a closed interval, meaning that both the start and end index values you specify will appear in the output. So by selecting a slice of rows by username, your output will show the first one, the last one, and everything in between.
The .iloc[]
attribute allows you to find rows based on their index position regardless of the index label’s value. When you use .iloc[]
, you use zero-based indexing. In other words, the first row is at location 0
, not 1
as you might expect. .iloc[]
uses an open interval, which means that while the start index value you specify will appear in the output, the end one won’t.
While these concepts will be very familiar to you if you’ve ever sliced a string or list, it’s important to note that the final item .iloc[]
appears to select won’t actually be selected.
Of course, it doesn’t matter whether you use .iloc[]
or .loc[]
if you’re dealing with default indexes because these also use zero-based sequences, so their labels and positions match. However, if your index contains strings or a non-default numeric sequence, the parameters passed to .iloc[]
will not resemble the actual index value of the row you’re accessing, making your code more difficult to read.
Note: While you can use both .iloc[]
and .loc[]
to select rows, it’s generally preferable to use .loc[]
if your index has explicit meaning. Otherwise, you might be more comfortable using .iloc[]
because of its similarity to other areas of Python. If you’re using default indexes, it won’t matter which you choose.
A good way to remember the difference between them is to mentally associate the initial letter l of .loc[]
with the word label. This will remind you that .loc[]
uses index label values.
Now that your DataFrame has been updated with a fresh index, you can use it to select some rows using both .loc[]
and .iloc[]
:
1>>> beach_boys = pd.read_csv(
2... "band_members.csv"
3... ).convert_dtypes(dtype_backend="pyarrow")
4
5>>> initials = ["BW", "ML", "AJ", "BJ", "CW", "DW", "DM", "RF", "BC"]
6>>> beach_boys.index = initials
7
8>>> beach_boys.loc[["BW"]]
9 first_name last_name instrument date_of_birth
10BW Brian Wilson Bass 20-Jun-1942
11
12>>> beach_boys.iloc[[1]]
13 first_name last_name instrument date_of_birth
14ML Mike Love Saxophone 15-Mar-1941
15
16>>> beach_boys.loc["BW":"BJ"]
17 first_name last_name instrument date_of_birth
18BW Brian Wilson Bass 20-Jun-1942
19ML Mike Love Saxophone 15-Mar-1941
20AJ Al Jardine Guitar 03-Sep-1942
21BJ Bruce Johnston Bass 27-Jun-1942
22
23>>> beach_boys.iloc[1:4]
24 first_name last_name instrument date_of_birth
25ML Mike Love Saxophone 15-Mar-1941
26AJ Al Jardine Guitar 03-Sep-1942
27BJ Bruce Johnston Bass 27-Jun-1942
In line 8, you accessed the first row whose index value is BW
by passing this as a string into .loc[]
. By passing it as a list to .loc[["BW"]]
, your output becomes a DataFrame. You could have passed it in directly, but this would have produced a pandas Series which behaves differently. The main point to note is that by passing an index value into .loc[]
, you returned the row whose index label is that value.
Compare this to the output after line 12. The code accesses row number 1
, this time with .iloc[[1]]
. Notice how the output is different. Using .iloc[[1]]
means that you’ve selected the second row of the DataFrame. Remember, .iloc[]
treats the DataFrame’s first row as 0
, regardless of the actual index label.
In line 16, you selected the first four rows by passing the labels of the first and fourth rows into .loc[]
.
Note: This slicing returns the BW
and BJ
records, plus everything in between, regardless of alphabetical order. It’s sometimes wise to sort an index using the DataFrame’s .sort_index()
method before slicing if an alphabetical selection of rows is required.
Finally, when you passed the slice 1:4
into .iloc[]
in line 23, you selected the rows starting at index position one—in other words, the second row but ending at index position three. Again, this is due to the zero-based numbering effect of rows and because the last slice parameter of position 4
is excluded. Remember .iloc[]
assumes an open interval.
Now, it’s time for you to try another challenge. If you get it wrong, you can always Do It Again:
Once again, re-read the data from band_members.csv
into a fresh DataFrame. Now, see if you can create the DataFrame shown below by assigning an appropriate list to beach_boys.index
, then select the bottom two rows using both .loc[]
and .iloc[]
:
>>> beach_boys
first_name last_name instrument date_of_birth
2 Brian Wilson Bass 20-Jun-1942
4 Mike Love Saxophone 15-Mar-1941
6 Al Jardine Guitar 03-Sep-1942
8 Bruce Johnston Bass 27-Jun-1942
10 Carl Wilson Guitar 21-Dec-1946
12 Dennis Wilson Drums 04-Dec-1944
14 David Marks Guitar 22-Aug-1948
16 Ricky Fataar Drums 05-Sep-1952
18 Blondie Chaplin Guitar 07-Jul-1951
This time, your index contains even numbers starting at 2
.
You’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
So far, you’ve used .reset_index()
and .index
to reset a pandas DataFrame index. Now it’s time to take a carve and look at yet another alternative.
Reset an Index Directly With .set_axis()
A third way to reset an index is to use the DataFrame’s .set_axis()
method. This method allows you to assign a new RangeIndex
object to your DataFrame, and also allows you to change the column labels.
To alter the existing index of a DataFrame, you can pass .set_axis()
a range
object using the built-in range()
constructor. This will assign an interval of ascending integers to the index that start at zero:
>>> beach_boys = pd.read_csv(
... "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")
>>> beach_boys.set_axis(range(len(beach_boys)))
first_name last_name instrument date_of_birth
0 Brian Wilson Bass 20-Jun-1942
1 Mike Love Saxophone 15-Mar-1941
2 Al Jardine Guitar 03-Sep-1942
3 Bruce Johnston Bass 27-Jun-1942
4 Carl Wilson Guitar 21-Dec-1946
5 Dennis Wilson Drums 04-Dec-1944
6 David Marks Guitar 22-Aug-1948
7 Ricky Fataar Drums 05-Sep-1952
8 Blondie Chaplin Guitar 07-Jul-1951
Here, you used .set_axis()
to reset the index back to its default value. To do this, you passed .set_axis()
a range
whose length was equal to that of the original beach_boys
DataFrame. Using len()
ensures there are the correct amount of numbers for each row in the DataFrame. The numbers generated will be zero-based and be just sufficient to cover each row. As you can see, the index has now been reset to its default value.
Once again it’s time to check your understanding. Feel free to ask a friend for help, but only if they’re called Rhonda:
Re-read the data from band_members.csv
Now, use .set_axis()
to see if you can make it look like this:
>>> beach_boys
first_name last_name instrument date_of_birth
0 Brian Wilson Bass 20-Jun-1942
1 Mike Love Saxophone 15-Mar-1941
4 Al Jardine Guitar 03-Sep-1942
9 Bruce Johnston Bass 27-Jun-1942
16 Carl Wilson Guitar 21-Dec-1946
25 Dennis Wilson Drums 04-Dec-1944
36 David Marks Guitar 22-Aug-1948
49 Ricky Fataar Drums 05-Sep-1952
64 Blondie Chaplin Guitar 07-Jul-1951
This time you want each value of the default index to be squared.
As usual, you’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
Note: At this point, you’ve seen three different ways of resetting a DataFrame index. You may be confused as to which one of them you should use. If all you want to do is reset your index to its default, then you can use the timeit module to time small pieces of code in either IPython or a Jupyter Notebook:
In [1]: import pandas as pd
In [2]: beach_boys = pd.read_csv(
...: "band_members.csv"
...: ).convert_dtypes(dtype_backend="pyarrow")
In [3]: %timeit -n 1000 beach_boys.reset_index(drop=True)
22.5 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit -n 1000 beach_boys.index = pd.RangeIndex(len(beach_boys.index))
3.75 µs ± 307 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [5]: %timeit -n 1000 beach_boys.set_axis(range(len(beach_boys)))
28.1 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As you can see from these three timings, if all you need is to reset an index to its default, assigning a RangeIndex
object directly to your DataFrame’s .index
property is the fastest of the three. It’s also the only way to reset the index on the original DataFrame. The other two ways create a copy that you can assign back to the beach_boys
variable. However, .reset_index()
, while slow, does offer more configuration options.
Note that your exact timing figures will be different from those shown here, but they should show the same trend.
Now that you know how to reset a DataFrame’s index, you’ll move on and learn why you’d want to do this. It’s not quite time to hang loose just yet.
Restore a Sequential Index
When you cleanse data prior to analysis, you often need to remove certain rows from a DataFrame. For example, you may need to remove duplicate or other unwanted rows. When you remove these, the index values of the rows are removed as well. This could lead to gaps appearing in the index. By resetting the index, you can fix this.
Later, in the section about aligning indexes of several DataFrames, you’ll learn how missing values in a DataFrame’s index can wreak havoc when you need to analyze its data. For now, you’ll mess up the index of your lovely beach_boys
DataFrame:
1>>> beach_boys = pd.read_csv(
2... "band_members.csv"
3... ).convert_dtypes(dtype_backend="pyarrow")
4
5>>> beach_boys
6 first_name last_name instrument date_of_birth
70 Brian Wilson Bass 20-Jun-1942
81 Mike Love Saxophone 15-Mar-1941
92 Al Jardine Guitar 03-Sep-1942
103 Bruce Johnston Bass 27-Jun-1942
114 Carl Wilson Guitar 21-Dec-1946
125 Dennis Wilson Drums 04-Dec-1944
136 David Marks Guitar 22-Aug-1948
147 Ricky Fataar Drums 05-Sep-1952
158 Blondie Chaplin Guitar 07-Jul-1951
16
17>>> beach_boys.drop(labels=[3, 5])
18 first_name last_name instrument date_of_birth
190 Brian Wilson Bass 20-Jun-1942
201 Mike Love Saxophone 15-Mar-1941
212 Al Jardine Guitar 03-Sep-1942
224 Carl Wilson Guitar 21-Dec-1946
236 David Marks Guitar 22-Aug-1948
247 Ricky Fataar Drums 05-Sep-1952
258 Blondie Chaplin Guitar 07-Jul-1951
As you can see, the DataFrame contains the original nine rows of band members with a default index. To remove some rows, you used the DataFrame’s .drop()
method. The rows to be dropped were defined in the Python list provided to its labels
parameter. In this case, the rows whose index label is 3
or 5
were both removed. Look carefully at the index in the output produced below line 17 and you’ll see that the index is no longer sequential.
To fix this, you can use any of the techniques you’ve learned so far. Here, you’ll use .reset_index()
which, because it doesn’t change the underlying DataFrame, means you can reuse the original cleanly-indexed DataFrame in later examples:
>>> (
... beach_boys
... .drop(labels=[3, 5])
... .reset_index()
... )
index first_name last_name instrument date_of_birth
0 0 Brian Wilson Bass 20-Jun-1942
1 1 Mike Love Saxophone 15-Mar-1941
2 2 Al Jardine Guitar 03-Sep-1942
3 4 Carl Wilson Guitar 21-Dec-1946
4 6 David Marks Guitar 22-Aug-1948
5 7 Ricky Fataar Drums 05-Sep-1952
6 8 Blondie Chaplin Guitar 07-Jul-1951
As you saw earlier, .reset_index()
has added a new default index to your DataFrame. However, you’re not happy just yet because your original index still remains, albeit in a new column. Once again, you need to adjust the drop
parameter of .reset_index()
to complete the job:
>>> (
... beach_boys
... .drop(labels=[3, 5])
... .reset_index(drop=True)
... )
first_name last_name instrument date_of_birth
0 Brian Wilson Bass 20-Jun-1942
1 Mike Love Saxophone 15-Mar-1941
2 Al Jardine Guitar 03-Sep-1942
3 Carl Wilson Guitar 21-Dec-1946
4 David Marks Guitar 22-Aug-1948
5 Ricky Fataar Drums 05-Sep-1952
6 Blondie Chaplin Guitar 07-Jul-1951
As you can see, by setting drop=True
, the original index is now nowhere to be seen, but your shiny new one looks, well, new and shiny.
Once again, it’s time to check your understanding. Oh, and if you can’t figure it out, Don’t Worry Baby:
To start, re-read the data from band_members.csv
, then use beach_boys.drop(labels=[3, 5])
to remove some of the rows. You’ve just learned how to reset the index to its default value using .reset_index(drop=True)
. See if you can do this again using either of the other two techniques you’ve learned. Your answer should look like this:
>>> beach_boys
first_name last_name instrument date_of_birth
0 Brian Wilson Bass 20-Jun-1942
1 Mike Love Saxophone 15-Mar-1941
2 Al Jardine Guitar 03-Sep-1942
3 Carl Wilson Guitar 21-Dec-1946
4 David Marks Guitar 22-Aug-1948
5 Ricky Fataar Drums 05-Sep-1952
6 Blondie Chaplin Guitar 07-Jul-1951
As you can see, the default index has been restored, and the original is nowhere to be found.
As with the other exercises, you’ll find a solution in the Solutions.ipynb
notebook provided in the downloadable materials.
All your efforts so far may mean you’re now seeing a twin fin. Next, you’ll get back on your board and get rid of those duplicates.
Remove Duplicate Index Values
You may be surprised to learn that indexes can sometimes have duplicate values. They don’t necessarily need to be unique identifiers. However, duplicates are usually something you want to avoid because they can cause issues. Fortunately, .reset_index()
can handle this for you.
Duplicate index values often arise when two DataFrames are merged. The duplication can cause problems with incorrect row selection, slicing, and filtering. Before you can see these problems, you first need to apply some duplicate values to your index:
1>>> beach_boys = pd.read_csv(
2... "band_members.csv"
3... ).convert_dtypes(dtype_backend="pyarrow")
4
5>>> guitar_players = beach_boys.query(
6... "instrument == 'Guitar'"
7... ).reset_index(drop=True)
8
9>>> guitar_players
10 first_name last_name instrument date_of_birth
110 Al Jardine Guitar 03-Sep-1942
121 Carl Wilson Guitar 21-Dec-1946
132 David Marks Guitar 22-Aug-1948
143 Blondie Chaplin Guitar 07-Jul-1951
15
16>>> others = beach_boys.query(
17... "instrument != 'Guitar'"
18... ).reset_index(drop=True)
19
20>>> others
21 first_name last_name instrument date_of_birth
220 Brian Wilson Bass 20-Jun-1942
231 Mike Love Saxophone 15-Mar-1941
242 Bruce Johnston Bass 27-Jun-1942
253 Dennis Wilson Drums 04-Dec-1944
264 Ricky Fataar Drums 05-Sep-1952
Here, you’ve split the original DataFrame into two new ones. The guitar_players
DataFrame, shown below line 9, contains the records of those group members who play guitar. The others
DataFrame, shown below line 20, contains the rest of the members.
To select the guitar players, you passed the "instrument == 'Guitar'"
query string into .query()
in lines 5 through 7, which extracts all rows where the instrument
column values match "Guitar"
.
Lines 16 through 18 use similar code that creates a second DataFrame containing the other rows. In this case, musicians who aren’t marked as guitar players.
In both cases, .reset_index()
was used to make sure the index in both new DataFrames was sequential. This ensured that some identical index values appear across both DataFrames. When you merge both of these together, you might think you’ll get back to your original DataFrame, but you won’t:
>>> all_beach_boys = pd.concat([guitar_players, others])
>>> all_beach_boys
first_name last_name instrument date_of_birth
0 Al Jardine Guitar 03-Sep-1942
1 Carl Wilson Guitar 21-Dec-1946
2 David Marks Guitar 22-Aug-1948
3 Blondie Chaplin Guitar 07-Jul-1951
0 Brian Wilson Bass 20-Jun-1942
1 Mike Love Saxophone 15-Mar-1941
2 Bruce Johnston Bass 27-Jun-1942
3 Dennis Wilson Drums 04-Dec-1944
4 Ricky Fataar Drums 05-Sep-1952
You’ve created a single new DataFrame using concat()
. By passing it both guitar_players
and others
, your new all_beach_boys
DataFrame shows the original nine band members once more, but the index contains duplicates. Now that you’ve a DataFrame with duplicate indexes, you’ll investigate the problems this can cause.
Suppose you want to select the fourth row—the row whose index position is 3
. You can’t use .loc[]
to do this because the duplicate index causes issues. Run the code below and you’ll see the problem:
>>> all_beach_boys.loc[3]
first_name last_name instrument date_of_birth
3 Blondie Chaplin Guitar 07-Jul-1951
3 Dennis Wilson Drums 04-Dec-1944
>>> all_beach_boys.iloc[[3]]
first_name last_name instrument date_of_birth
3 Blondie Chaplin Guitar 07-Jul-1951
Take a look at the highlighted lines. As you can see, because .loc[]
selected rows whose index label is 3
, two records were returned. To fix this, you’d need to use .iloc[]
to select the required single row.
The duplicate index values also cause havoc when you try to select contiguous rows using slicing. Suppose you wanted to see the rows at index positions three and four. Your first attempt might be to try .loc[]
:
>>> all_beach_boys.loc[3:4]
Traceback (most recent call last):
...
KeyError: 'Cannot get left slice bound for non-unique label: 3'
>>> all_beach_boys.iloc[3:5]
first_name last_name instrument date_of_birth
3 Blondie Chaplin Guitar 07-Jul-1951
0 Brian Wilson Bass 20-Jun-1942
As you can see, when you try to pass the required index positions into .loc[]
, it throws a KeyError
exception because you have a non-unique label. To fix this, you’d need to resort to using .iloc[]
instead.
Note: It’s possible to use slicing with .loc[]
when duplicate indexes are involved. Before doing this, make sure you sort the index first using .sort_index()
. Otherwise, you’ll raise a KeyError
exception due to the duplicate values:
>>> all_beach_boys.sort_index().loc[3:4]
first_name last_name instrument date_of_birth
3 Blondie Chaplin Guitar 07-Jul-1951
3 Dennis Wilson Drums 04-Dec-1944
4 Ricky Fataar Drums 05-Sep-1952
As you can see, the rows with index labels 3
and 4
have been returned.
Suppose you now want to see those elements with index labels 1
and 3
. This time, you use the DataFrame’s .filter()
method:
>>> all_beach_boys.filter(items=[1, 3], axis="index")
Traceback (most recent call last):
...
ValueError: cannot reindex on an axis with duplicate labels
You’ve tried to filter the DataFrame with index labels 1
and 3
by passing these in as a list to the items
parameter of .filter()
. You’ve also set the axis
parameter to "index"
to apply the filter to the index. Despite your code being technically correct, the result is a ValueError
exception because of the duplicate labels.
At the moment, you’re probably feeling a bit rag dolled with all of these setbacks. It’s time to get back on your board again and solve the problems:
See if you can fix this code so that both .loc[]
and .iloc[]
produce the same results, and so .filter()
works as expected when applied to all_beach_boys
. Also, make sure the problematic old index has been removed:
>>> all_beach_boys.loc[[3]]
first_name last_name instrument date_of_birth
3 Blondie Chaplin Guitar 07-Jul-1951
>>> all_beach_boys.iloc[[3]]
first_name last_name instrument date_of_birth
3 Blondie Chaplin Guitar 07-Jul-1951
>>> all_beach_boys.filter(items=[1, 3], axis="index")
first_name last_name instrument date_of_birth
1 Carl Wilson Guitar 21-Dec-1946
3 Blondie Chaplin Guitar 07-Jul-1951
Well done. Everything works!
Now suppose you want to promote an existing column to an index. Is this possible? It’s time to drop in, catch a wave, and see.
Use an Existing Column as an Index
While the default sequential numeric index provides a unique accessor to the rows within your DataFrame, it’s unlikely that it’ll have any inherent meaning. For example, the numbers assigned to each row of the DataFrames you’ve used up to this point have no meaning relative to the data they’re indexing.
In the DataFrame read in from band_members.csv
, band member Brian Wilson
has an index value of 0
simply because he appears first in the file. This has no inherent meaning, even though it may offend you if you’re a fan and think he should be number 1
.
While the concept of using an unrelated sequential index will be familiar to you if you’ve worked with keys in a relational database, you may want something more meaningful in your DataFrames.
If you want more user-friendly index labels, you could use an existing column and promote it to the index. To do this, you use the .set_index()
method. Although you can promote any existing column to become the index, keep in mind that unless your intended column contains unique values, you’ll still run into the same problems with duplicates that you saw earlier.
Suppose you want to reset your index to contain the first_name
labels. You could do this as shown here:
>>> beach_boys = pd.read_csv(
... "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")
>>> (
... beach_boys
... .set_index("first_name")
... .loc[["Brian", "Carl"]]
... )
last_name instrument date_of_birth
first_name
Brian Wilson Bass 20-Jun-1942
Carl Wilson Guitar 21-Dec-1946
When you call .set_index()
on beach_boys
and pass in "first_name"
, this column gets promoted to the index. You can then use .loc[]
to select one or more rows using the first names of the musicians you’re interested in.
In some cases, you may wish to reset your index so that its existing values become more meaningful to the data they’re indexing. While you can’t use .reset_index()
for this, you could apply something more suitable to the DataFrame’s .index
attribute.
A common example is using employee identifiers instead of plain sequential integers. The code below updates the existing sequential index with more contextual values:
>>> beach_boys.index = [f"Employee_{x + 1}" for x in range(len(beach_boys))]
>>> beach_boys
first_name last_name instrument date_of_birth
Employee_1 Brian Wilson Bass 20-Jun-1942
Employee_2 Mike Love Saxophone 15-Mar-1941
Employee_3 Al Jardine Guitar 03-Sep-1942
Employee_4 Bruce Johnston Bass 27-Jun-1942
Employee_5 Carl Wilson Guitar 21-Dec-1946
Employee_6 Dennis Wilson Drums 04-Dec-1944
Employee_7 David Marks Guitar 22-Aug-1948
Employee_8 Ricky Fataar Drums 05-Sep-1952
Employee_9 Blondie Chaplin Guitar 07-Jul-1951
In this code, you used a list comprehension to create a list of strings containing ["Employee_1", "Employee_2", ...]
. This then gets assigned to the .index
attribute of beach_boys
.
Now that you have a more meaningful index, you can use it in the same way you used the default numeric index. For example, you can select rows by their new Employee_
values:
>>> beach_boys.loc[["Employee_4"]]
first_name last_name instrument date_of_birth
Employee_4 Bruce Johnston Bass 27-Jun-1942
In the code above, you used .loc[]
to select the record of Employee_4
.
Now for a bit of a mind-bender. Wouldn’t it be nice if you could do this:
Using the original beach_boys
DataFrame read in from band_members.csv
, create an index consisting of the usernames of staff in the format <last_name><initial>
. Your final result should look like this:
>>> beach_boys
first_name last_name instrument date_of_birth
WilsonB Brian Wilson Bass 20-Jun-1942
LoveM Mike Love Saxophone 15-Mar-1941
JardineA Al Jardine Guitar 03-Sep-1942
JohnstonB Bruce Johnston Bass 27-Jun-1942
WilsonC Carl Wilson Guitar 21-Dec-1946
WilsonD Dennis Wilson Drums 04-Dec-1944
MarksD David Marks Guitar 22-Aug-1948
FataarR Ricky Fataar Drums 05-Sep-1952
ChaplinB Blondie Chaplin Guitar 07-Jul-1951
As you can see, the index now contains a common username format, which will make it straightforward to select users by their usernames.
Next, you’ll gain some experience tandem surfing. It’s time to use not one, but two DataFrames together.
Align Indexes of Several DataFrames
One of the great features of working with pandas DataFrames is that you can use the basic arithmetic operators to add their data together. Unfortunately, you’ll only be allowed this convenience if their indexes align. Otherwise, you’ll run into problems.
Suppose you’re analyzing the weekly sales of records from a record store. Two weeks of sales data are stored inside two CSV files named week1_record_sales.csv
and week2_record_sales.csv
. For demonstration purposes, both files contain identical sales data but their indexes are different:
>>> week1_sales = pd.read_csv(
... "week1_record_sales.csv"
... ).set_index("index")
>>> week2_sales = pd.read_csv(
... "week2_record_sales.csv"
... ).set_index("index")
>>> week1_sales
day sales
index
0 Mon 100
1 Tue 150
2 Wed 200
3 Thu 250
4 Fri 300
>>> week2_sales
day sales
index
1 Mon 100
2 Tue 150
3 Wed 200
4 Thu 250
5 Fri 300
Each file is read into a DataFrame and contains daily sales information. Every row is identified by its index
column, which has been set as the DataFrame’s index with .set_index()
.
Suppose you now want to find the total sales of both weeks. This should be achievable with little more than a simple arithmetic operation:
>>> week1_sales["sales"] + week2_sales["sales"]
index
0 NaN
1 250.0
2 350.0
3 450.0
4 550.0
5 NaN
Name: sales, dtype: float64
As you can see, something’s gone wrong. Because both DataFrames contain the same data, you’d expect the answers to be twice their original values. Instead, the first and last answers are NaN
, meaning an arithmetic calculation couldn’t be performed due to missing values. Also, the remaining results are incorrect.
Both of these issues were caused by mismatched indexes. The NaN
values have appeared because neither index 0
nor index 5
appear in both of your DataFrames. The calculations are wrong because, for example, the Wednesday sales figure of 200
is indexed as 2
in your first DataFrame, while index 2
refers to the Tuesday sales of 150
in your second DataFrame. When you add these, the result is meaningless.
You can also merge two DataFrames in the same way that relational database tables can be merged. This allows you to see matching data from both DataFrames in the same place. Again, if you join on index, each index value must refer to the related data in both DataFrames.
For example, say you wanted to see all data for both sales weeks. To do this, you can use the DataFrame’s .merge()
method:
>>> week1_sales.merge(week2_sales, left_index=True, right_index=True)
day_x sales_x day_y sales_y
index
1 Tue 150 Mon 100
2 Wed 200 Tue 150
3 Thu 250 Wed 200
4 Fri 300 Thu 250
Right away you can see some problems. Neither the records with index 0
nor index 5
are anywhere to be seen. The daily figures don’t align either.
Here, you’ve performed an inner join of both DataFrames, meaning only records whose index values appear in both DataFrames will be merged. Because index values 0
and 5
don’t appear in both, they’re not included in the merge. The days are mismatched because the same index refers to different days in each DataFrame.
To fix both of these problems, you need to make sure both DataFrames are using the same index. One way would be to reset the index on week2_sales
to its default. This will then match that used by week1_sales
, but only because the daily data for both DataFrames are already in the same order:
>>> week2_sales = week2_sales.reset_index(drop=True)
As before, to reset the index back to its default, you use .reset_index()
and pass True
to its drop
parameter to remove the problematic original index. Now when you run both of the previous pieces of code, the results are far more palatable:
>>> week1_sales["sales"] + week2_sales["sales"]
index
0 200
1 300
2 400
3 500
4 600
Name: sales, dtype: int64
>>> week1_sales.merge(week2_sales, left_index=True, right_index=True)
day_x sales_x day_y sales_y
index
0 Mon 100 Mon 100
1 Tue 150 Tue 150
2 Wed 200 Wed 200
3 Thu 250 Thu 250
4 Fri 300 Fri 300
As you can see, everything now matches and nothing is missing. Aligning the indexes has solved both issues in one fell swoop. However, this example only worked because the rows within the DataFrames were in the correct order to begin with. This won’t always be the case.
This time you’re going to think laterally. So give your brain a Wipeout and see if you can solve this exercise:
Suppose you’re happy with the index of week2_sales
, but not with that of the week1_sales
DataFrame. See if you can use one of the techniques you learned about earlier to apply the index from week2_sales
to that of week1_sales
. Don’t forget to make sure both the addition and .merge()
still produce the correct output.
So far, you’ve merged the DataFrames on numerical indexes. Can you think of a better alternative that would still work even if both original DataFrames had their rows in a different order? Again, make sure that both the addition and .merge()
produce the correct result.
Remember, you can always take a look at the solution in your downloaded materials.
To round off your learning experience, you’ll finish up by learning how to reset multi-level indexes on DataFrames. Not something surfers normally do, unless they’re Pythonistas like you.
Reset Multi-Indexes
Each of the DataFrames you’ve been working with so far has consisted of single-column Index
objects. DataFrames also support MultiIndex
objects, which provide hierarchical, or multi-level indexes for your DataFrames.
In this section, you start to wake up from your California Dreaming of the Beach Boys and decide to have breakfast. You’ll use the cereals.csv
file to help you decide which cereal to eat. This file contains data about various popular breakfast cereals from a range of manufacturers. The original data comes from Kaggle and is freely available under the Creative Commons License. Here, you’re using a cut-down version of it.
The first thing that you’ll need to do is read the cereals data into a DataFrame:
>>> cereals = pd.read_csv("cereals.csv").convert_dtypes(
... dtype_backend="pyarrow"
... )
>>> cereals.head()
name manufacturer type fiber
0 100% Bran Nabisco Cold 10.0
1 100% Natural Bran Quaker Oats Cold 2.0
2 All-Bran Kelloggs Cold 9.0
3 All-Bran with Extra Fiber Kelloggs Cold 14.0
4 Almond Delight Ralston Purina Cold 1.0
As you can see, the file contains details of different breakfast cereals. When you call the DataFrame’s .head()
method, you see the first five records that reveal the cereal’s name
and manufacturer
. You can also see its type
, which tells you whether the cereal is to be eaten hot or cold, as well as its fiber
content.
Not surprisingly, this DataFrame has a simple index. One quick way to create a MultiIndex
is to create a pivot table from it:
>>> cereals.pivot_table(
... values="fiber",
... index=["manufacturer", "type"],
... aggfunc="mean",
... )
fiber
manufacturer type
American Home Food Products Hot 0.0
General Mills Cold 1.272727
Kelloggs Cold 2.73913
Nabisco Cold 4.6
Hot 1.0
Post Cold 2.777778
Quaker Oats Cold 1.142857
Hot 2.7
Ralston Purina Cold 1.875
This pivot table, which is actually another DataFrame, analyzes the raw data by working out the average fiber content for each type of cereal for each manufacturer. The index of this DataFrame is a bit different from what you’re used to seeing:
>>> cereals.pivot_table(
... values="fiber",
... index=["manufacturer", "type"],
... aggfunc="mean",
... ).index
MultiIndex([('American Home Food Products', 'Hot'),
( 'General Mills', 'Cold'),
( 'Kelloggs', 'Cold'),
( 'Nabisco', 'Cold'),
( 'Nabisco', 'Hot'),
( 'Post', 'Cold'),
( 'Quaker Oats', 'Cold'),
( 'Quaker Oats', 'Hot'),
( 'Ralston Purina', 'Cold')],
names=['manufacturer', 'type'])
The MultiIndex
in this pivot_table consists of both manufacturer
and type
columns. Instead of the simple single columns you’ve seen so far, you now have a more complex multi-level structure.
In this example, you’ve created a MultiIndex
consisting of two levels. These are defined by passing "manufacturer"
and "type"
to the index
parameter of .pivot_table()
.
Within the MultiIndex
object, manufacturer
is known as level 0, while type
is level 1. These level numbers are important if you need to reset the index because they allow you to do so on a specific level, or even reset it completely.
Suppose you wanted to reset only level 1, corresponding to type
, which relegates it to a separate column before removing it completely:
>>> cereals = pd.read_csv("cereals.csv").convert_dtypes(dtype_backend="pyarrow")
>>> cereals.pivot_table(
... values="fiber",
... index=["manufacturer", "type"],
... aggfunc="mean"
... ).reset_index(level=1, drop=True)
>>> cereals
fiber
manufacturer
American Home Food Products 0.0
General Mills 1.272727
Kelloggs 2.73913
Nabisco 4.6
Nabisco 1.0
Post 2.777778
Quaker Oats 1.142857
Quaker Oats 2.7
Ralston Purina 1.875
By passing both level=1
and drop=True
to .reset_index()
, you drop the type
details, retaining only the manufacturer
information as a simple Index
.
Take care when doing this, because you’ve now created duplicate index values with all of the problems you saw earlier. In addition, your data has now lost some of its meaning. For example, there’s confusion over what both the Nabisco
and Quaker Oats
labels are showing you. You no longer know which one refers to the hot cereals and which to the cold.
Be aware that resetting part of a MultiIndex
can have undesirable side effects that aren’t always obvious.
Time to Catch a Wave and go on a Surfin’ Safari through your penultimate challenge:
Using the previous code for guidance, see if you can produce this exact output to provide clarification to the Nabisco
and Quaker Oats
confusion:
>>> cereals
manufacturer fiber
type
Cold General Mills 1.272727
Cold Kelloggs 2.73913
Cold Nabisco 4.6
Cold Post 2.777778
Cold Quaker Oats 1.142857
Cold Ralston Purina 1.875
Hot American Home Food Products 0.0
Hot Nabisco 1.0
Hot Quaker Oats 2.7
The type
information forms the index while the manufacturer
information is back in a column.
Sometimes, it’s better to reset all levels of a MultiIndex
but retain all data. When you apply .reset_index()
to a MultiIndex
using its default parameters, you’ll replace the complete index with a simple default version and create additional columns from the original index inside your DataFrame:
>>> cereals.pivot_table(
... values="fiber",
... index=["manufacturer", "type"],
... aggfunc="mean",
... ).reset_index()
manufacturer type fiber
0 American Home Food Products Hot 0.0
1 General Mills Cold 1.272727
2 Kelloggs Cold 2.73913
3 Nabisco Cold 4.6
4 Nabisco Hot 1.0
5 Post Cold 2.777778
6 Quaker Oats Cold 1.142857
7 Quaker Oats Hot 2.7
8 Ralston Purina Cold 1.875
This time your DataFrame has a default index applied to it, but it also has new manufacturer
and type
columns, as well as the aggregated fiber
column’s data. Most importantly, the data hasn’t lost any of its meaning.
This final challenge will test those Good Vibrations you now have about your ability to work with DataFrame indexes.
See if you can solve the undesirable side effects of your original pivot table by flattening the MultiIndex
. Perhaps it’ll look something like this:
>>> cereals
fiber
(American Home Food Products, Hot) 0.0
(General Mills, Cold) 1.272727
(Kelloggs, Cold) 2.73913
(Nabisco, Cold) 4.6
(Nabisco, Hot) 1.0
(Post, Cold) 2.777778
(Quaker Oats, Cold) 1.142857
(Quaker Oats, Hot) 2.7
(Ralston Purina, Cold) 1.875
The index now consists of one level instead of two, but each level is a tuple.
That’s it! You should be stoked that you now have a comprehensive understanding of how to reset the index on your DataFrames using several techniques. You’ve also had a bonus lesson on how to incorporate Beach Boys song titles and corny surfing references into a Python tutorial. A truly rad original from your friends at Real Python.
Conclusion
In this tutorial, you learned that while the .reset_index()
method is the most customizable way to reset an index, it isn’t the fastest option. However, it is useful when you’re working with MultiIndex
resets.
You also learned that directly applying the index to the DataFrame’s .index
property is the quickest way to reset an index, and that it changes the original DataFrame in the process. Additionally, you discovered how the .set_axis()
method allows you to reset and re-label your index should you want to.
Congratulations on completing this tutorial, and enjoy applying these newfound skills to better prepare your DataFrames for analysis. Happy surfing!
Get Your Code: Click here to download the free sample code you’ll use to learn how to reset a pandas DataFrame index.
Take the Quiz: Test your knowledge with our interactive “How to Reset a pandas DataFrame Index” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
How to Reset a pandas DataFrame IndexThis quiz will challenge your knowledge of resetting indexes in pandas DataFrames. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.