Accessing Data in a DataFrame
00:00 Since a DataFrame is a collection of Series, everything you learned in the previous lesson also applies to DataFrames. But DataFrames are two-dimensional, so indexing them is a little different.
00:13
A DataFrame
is conceptually like a Python dictionary, where the keys are the column names and the values are in a Series
. Recall the city_data
DataFrame
from the previous lesson.
00:26
It has a 'revenue'
column and the values in the column are stored in a Series
with the city names as the index. For column names that are strings, you can treat them like attributes of the DataFrame
and get each Series
using dot notation.
00:43
Keep in mind that dot notation will not work if the column name is a DataFrame
attribute or method name. For example, if you had a column named 'shape'
, you could access it with the indexing operator but not with dot notation. .shape
is an attribute of the DataFrame
and will always return the dimensions of the DataFrame
. In general, dot notation should only be used in interactive sessions, such as a Jupyter Notebook.
01:11
You can use the .loc
attribute to get a particular row in a DataFrame
with the row’s label. The .iloc
attribute will use the zero-based positional index of the row.
01:25
Additionally, you can slice rows using the .loc
attribute.
01:31
This will select all rows, starting at the label 'Tokyo'
up to and including 'Toronto'
. The .loc
attribute includes the upper bound.
01:43
Another trick that works on Python lists is negative indexing. For example, the second to last item in a list could be found at index -2
. The same goes for the second to last row of a DataFrame
using the .iloc
attribute. Try it out on the nba
DataFrame
.
02:04
In the previous lesson, the .loc
and .iloc
attributes used only a single value, but DataFrames have a second dimension and the .loc
and .iloc
attributes have been extended to take advantage of it.
02:18
What if you wanted to get the revenue column for the cities 'Amsterdam'
through 'Tokyo'
? Similar to how you would index a multidimensional NumPy array, you would just add the column name.
02:33
And you can select multiple columns. Try this with the nba
DataFrame
. Select all games with the labels 5555
through 5559
.
02:44
Then select the 'fran_id'
(franchise ID), 'opp_fran'
(opposition franchise), 'pts'
(points), and 'opp_pts'
(opposition points) columns.
02:54 Simply include the column names in square brackets. Now you can remove the unneeded columns. In the next lesson, you’ll learn how to use queries for selection with more accuracy.
Become a Member to join the conversation.