Querying Your Dataset
00:00 So far, you’ve been using the columns and index to select data from your dataset. Using queries, you can get more precise and select data based on the values.
00:12
Here’s how to select all games from the nba
DataFrame
played after the year 2010.
00:20
Now notice that this returns a DataFrame
. The columns are still there, but the number of rows is smaller because only games played after 2010 have been included.
00:33
So, how does this work? First, look at the query itself. This evaluates the 'year_id'
column and compares each value to 2010
. If the value is greater, the result is True
, and False
if it is less or equal.
00:51
Then all of the results are returned in a Series
. Note that this Series
has all 126,314 rows in the nba
DataFrame
. However, if you query the DataFrame
, it will only return the rows where a True
value exists in the Series
.
01:11
This DataFrame
contains only the 15,000 or so games played after 2010.
01:19
You can get quite detailed with queries. For example, the notes
column in the nba
DataFrame
does not always contain data.
01:27
First of all, verify this with the .info()
method. It seems that only 5,400 games have notes. How can you weed out the games with no notes? The notes
column will be null for the 120,000 or so games that have no notes.
01:44
Therefore, you can use the .notnull()
method to test the 'notes'
column. Just like the previous demo, this returns a Series
of bools, and you can use this to query the entire DataFrame
to include just the data you want.
02:01
Remember the object
data type? It’s how strings are stored in the DataFrame
. You can access the .str
attribute of a object
column and call string methods to filter the column on the values. For example, the 'fran_id'
column stores the name of the franchise of the team.
02:19
Many of them end in 'ers'
—'Lakers'
, 'Cavaliers'
, and so on. Using the .endswith()
method for a string, you can create a filter to get only those teams.
02:33 Sometimes, a single criteria is not enough to select the data you need. You can combine criteria with the logical AND and OR operators.
02:42
The logical AND operator is the single ampersand (&
), while the logical OR operator is the single pipe (|
).
02:50
This next query gets all of the games in the nba
DataFrame
with an ID of 'BLB'
that scored more than 100
points
03:01
and the opposing team scored more than 100
points.
03:07 Notice that the criteria are in parentheses to ensure they are evaluated separately, and it’s easier to remember what you meant the next time you see it.
03:17 In the next lesson, you’ll learn how to summarize your dataset with aggregation and grouping.
Become a Member to join the conversation.