Selecting Data Points
In this lesson you’ll learn about selecting data points in your visualizations. Implementing selection behavior is as easy as adding a few specific keywords when declaring your glyphs. You will start by modifying read_nba_data.py
and aggregating data from the player_stats
DataFrame.
For even more information about what you can do upon selection, check out Selected and Unselected Glyphs.
File: read_nba_data.py
import pandas as pd
# Read the csv files
player_stats = pd.read_csv('data/2017-18_playerBoxScore.csv',
parse_dates=['gmDate'])
team_stats = pd.read_csv('data/2017-18_teamBoxScore.csv',
parse_dates=['gmDate'])
standings = pd.read_csv('data/2017-18_standings.csv',
parse_dates=['stDate'])
# Create west_top_2
west_top_2 = (standings[(standings['teamAbbr'] == 'HOU') |
(standings['teamAbbr'] == 'GS')]
.loc[:, ['stDate', 'teamAbbr', 'gameWon']]
.sort_values(['teamAbbr', 'stDate']))
# Find players who took at least 1 three-point shot during the season
three_takers = player_stats[player_stats['play3PA'] > 0]
# Clean up the player names, placing them in a single column
three_takers['name'] = [f'{p["playFNm"]} {p["playLNm"]}'
for _, p in three_takers.iterrows()]
# Aggregate the total three-point attempts and makes for each player
three_takers = (three_takers.groupby('name')
.sum()
.loc[:,['play3PA', 'play3PM']]
.sort_values('play3PA', ascending=False))
# Filter out anyone who didn't take at least 100 three-point shots
three_takers = three_takers[three_takers['play3PA'] >= 100].reset_index()
# Add a column with a calculated three-point percentage (made/attempted)
three_takers['pct3PM'] = three_takers['play3PM'] / three_takers['play3PA']
File: ThreePointAttVsPct.py
# Bokeh Libraries
from bokeh.plotting import figure, show
from bokeh.io import output_file
from bokeh.models import ColumnDataSource, NumeralTickFormatter
# Import the data
from read_nba_data import three_takers
# Output to file
output_file('three_point_att_vs_pct.html',
title='Three-Point Attempts vs. Percentage')
# Store the data in a ColumnDataSource
three_takers_cds = ColumnDataSource(three_takers)
#Specify the selection tools to be made available
select_tools = ['box_select', 'lasso_select', 'poly_select', 'tap', 'reset']
# Create the figure
fig = figure(plot_height=400,
plot_width=600,
x_axis_label='Three-Point Shots Attempted',
y_axis_label='Percentage Made',
title='3PT Shots Attempted vs. Percentage Made (min. 100 3PA), 2017-18',
toolbar_location='below',
tools=select_tools)
# Format the y-axis tick label as percentages
fig.yaxis[0].formatter = NumeralTickFormatter(format='00.0%')
# Add square representing each player
fig.square(x='play3PA',
y='pct3PM',
source=three_takers_cds,
color='royalblue',
selection_color='deepskyblue',
nonselection_color='lightgray',
nonselection_alpha=0.3)
# Visualize
show(fig)
00:00
This tutorial is going to focus on player stats. In order to be selecting points of data, you’re going to need to go back to read_nba_data.py
and modify the DataFrames coming in. And just below where you created the west_top_2
,
00:16 you’re going to create information about three-point shots. So to start with, you’re going to look for and find the players who took at least one three-point shot during the season.
00:28
This DataFrame is going to be called three_takers
, and it’s going to start with player_stats
and then filter down to be player_stats
where column 'play3PA'
(play three-point attempt)—player three-points attempts—are greater than 0
.
00:42
So that will get you the three-point takers, any of them. All right. You’re going to continue to clean up this data by creating a column called 'name'
.
00:50
That’ll take the last and first name and combine them into a single column. To do that, you’re going to use f-strings. This new column called 'name'
is equal to, in this case, an f-string that takes as its first element
01:09
"playFNm"
(player’s first name), space, and then it’s going to be the "playLNm"
(player’s last name). So those two objects that you’re replacing with these f-strings here, you’re going to be iterating through all of the columns, and for each p
you’re going to grab the column "playFNm"
, put a space in it, and "playLNm"
. And then from that,
01:37
for each item p
in three_takers
, and then you’re going to use .iterrows()
. So from three_takers
, you’re going to iterate through all of them. For each one of those, you’re going to glue together to create this new name using an f-string.
01:55 Next, you’re going to do do some aggregation.
02:00
You’re going to aggregate the total three-point attempts and three-point shots they’ve made for each player. To do that, start with three_takers
, take three_takers
and group by the name.
02:18
And then you’re going to sum, and you’re only going to keep these columns, "play3PA"
(player attempt) and "play3PM"
(player three-point made).
02:31 And then you’re going to sort values by three-point attempts.
02:40
ascending
will be set to False
so it will be descending. Okay, close off that statement. Next, filter out anyone who didn’t take at least 100 three-point shots.
03:03
So, the three-point attempts greater or equal to 100
. And then once you’ve narrowed that down, reset the index. Last, add a column with the calculated three-point percentage.
03:19
Basically, made versus the attempted shots. To do that, you’re going to create a new column. It’s going to be called 'pct3PM'
(percent three-point made).
03:28
From three_takers
, you’re going to say for 'play3PM'
(play three-point made) divided by three_takers['play3PA']
. So it’s going to make this new column with a calculated percentage between these two. Great! Okay, ready to save.
03:44 Let’s briefly look at that data, and I’ll have you look at that data by going into the terminal.
03:51
And bring up the REPL by typing python
or python3
. And then from read_nba_data
import everything by using the asterisk (*
). Looks like I made a small mistake here.
04:07 So, inside of here, I didn’t close off the square bracket on this, and so my f-string didn’t complete correctly. Let me attempt to save again. Okay. What do you think of that now? Okay.
04:22
There’s a little warning here about using .iterrows()
, it’s okay. We’re going to use three_takers
, see if that data’s there. Yep, looks like it’s there. Great! Let’s just take a small sample from it.
04:33
We want 5
players. Great! So these are the row indexes, there’s the names that have been glued together, and then the three-point attempts, and then the points made, and then the percentage. Looks good!
04:43 So, let’s say you want to select groups of players in the distribution, and in doing so, mute the color of the glyphs representing any non-selected players.
04:52
Let’s create a new visualization. So create a new script. It’s going to be called ThreePointAttVsPct.py
. Okay. Let’s start off by bringing in Bokeh libraries. from bokeh.plotting import figure, show
, from bokeh.io import output_file
, from bokeh.models import ColumnDataSource
.
05:19
Then I’ll have you add something else called the NumeralTickFormat
. This is to help with percentages in the ticks. Okay. One other thing you need to do is import the data, this time from read_nba_data
import only the three_takers
that you created earlier, that DataFrame.
05:38
Okay. Create the output file. The static HTML file’s going to be called 'three_point_att_vs_pct.html'
. The title='Three-Point Attempts vs. Percentage'
. Great. Okay.
05:56
You need to create a ColumnDataSource
, so store the data in a ColumnDataSource
. Name it three_takers_cds
, for ColumnDataSource.
06:06
And it’s in a ColumnDataSource
of the three_takers
DataFrame. Great.
06:16
To create this, you’re going to specify for the tool set that you’re going to use, create this list called select_tools
. And in that list, you’re going to need 'box_select'
, 'lasso_select'
, polygon select—which is just called 'poly_select'
—'tap'
, and also include 'reset'
. Great, so there’s your list.
06:42
Now create the figure. fig = figure()
with a plot_height=400
, plot_width=600
, x-axis labeled 'Three-Point Shots Attempted'
, and a y-axis label of 'Percentage Made'
.
07:09 The title for the image or for this visualization is three-point shots attempted versus percentage made, minimum of 100 three-point attempts per season in 2017-18.
07:27
And the last two items. The toolbar for this one, the location will be on the bottom, so 'below'
. And the tools
are going to be the select tools you created a little bit earlier.
07:41 Okay. There’s your figure. One thing you’re going to add is to fix the y-axis tick labels to be percentages. For the y-axis,
07:57
.formatter
is going to be NumeralTickFormatter()
with a format
equal to this style, '00.0%'
. Okay. Now you’re going to add the glyphs representing each player.
08:13
They’re going to be squares, fig_square
. And here you go, selecting columns of your data, 'play3PA'
versus 'pct3PM'
that you made earlier.
08:28
The source
is going to be the ColumnDataSource
three_takers_cds
. Colors! color
you’re going to set up to be 'royalblue'
.
08:37
That’s with nothing selected. And then for a selection_color
, you use 'deepskyblue'
, and then for nonselection_color
, 'lightgray'
.
08:49
And a nonselection_alpha
of 0.3
. Great! And last step, turn on that visualization by showing the fig
. All right, save.
09:04 Down here, run your script.
09:15
Oh, I added an underscore here. That’s incorrect. It should be just .yaxis
, not .y_axis
. Now if I re-save, let’s try one more time. Okay.
09:28 Here’s your toolbar on the bottom. Right now it’s set up to have the lasso as the default selected tool. How does lasso work? You click and hold and draw around the points that you’d want to select.
09:39 So those new ones, those are deep sky blue versus the light gray of the unselected. Kind of a cool trick is if you hold the Shift key, you can do another lasso selection if you want.
09:49 That works with most of the tools, the Shift key. This is the reset that you chose earlier. This one is for selecting points, individual points, so you could select a single point or you could hold Shift and select a couple.
10:02 Then this is a box selector. Again, you can use Shift if you want to select multiple boxes. And then this polygon select is a little funky. If you click your first two points, you see nothing, but when you click your third point, you’ll start to see your polygon, and then you can continue to click and then to get your selection finished, you double-click. So, it’s a little interesting. So again, click, click, click, and you get your triangle, and then you can kind of go from there and then double-click to finish your selection. Great! Now I noticed something back in my code.
10:34
I said, #Specity
, not #Specify
.
10:39 Just fixing that little typo. Great! Let’s move on to a little bit more interaction with hovering.
Dipanwita Mallick on Feb. 26, 2021
fig.yaxis[0].formatter = NumeralTickFormatter(format='00.0%')
Why is yaxis[0]
used here instead of just yaxis
?
rbtmaldonado on June 16, 2023
Just a quick note, code as is will display TypeError due to .sum call on datetime64 type.
TypeError: datetime64 type does not support sum operations
One workaround can be to add numeric only into .sum method:
three_takers = (three_takers.groupby('name')
.sum(numeric_only=True)...)
Become a Member to join the conversation.
Pygator on Aug. 18, 2019
This set of video tutorials are great! I can already dream up some use cases. Some video series about some image manipulation packages would be great.