Mapping and Analyzing a Data Set in Pandas
In order to do your analysis, you may come across columns, which are not in the correct format for further evaluations.
Therefore, you’ll learn in this lesson how to convert a column into a different data type and format by using the map()
function.
00:00 What we’re going to next be doing is calculating the number of field goals made per minute and attempted. In order to do that, we’re going to need a subset of our larger dataset, so let’s just slice that out in order to deal with it better.
00:11
What we’re going to need to do is to go data
, and we’re going to go within that dataset, we’re going to have another list, which will allow us to slice multiple values from the original DataFrame
and provide a new DataFrame
.
00:24
So we’re going to do the 'MP'
(minutes played), 'FG'
(field goals), and 'FGA'
(field goals attempted). And then we’re going to ask for the types of those files, because we’re going to need to interact with those datasets, so we need to know what’s actually contained in those columns.
00:40
So as you can see, our minutes played is an object
, which is not something that we can do division with, so we need to figure out a way to calculate this row of data into a manageable dataset.
00:54 But our field goals attempted and field goals are both integers, which are easy to deal with. So what we’re going to need to do is change this minutes played column into something that’s more useful—maybe seconds might be useful.
01:08 So what we’re going to need to do is take this row of data and map it to manipulate it and then put it back into the dataset. So what we’re going to need to do is something called mapping, which we’re going to do in Python. But we first need to define our function.
01:25
We’re going to define a function that takes a string "<minutes>:<seconds>"
into total seconds. So we’re going to need to import a time
function, we’re going to import a datetime
function. Once we have that, we’re going to go str_to_seconds()
(string to seconds) function, and we’re going to pass in minutes
as what is represented by the column for minutes played.
01:51
Just to be sure, we’re going to convert all the minutes to a string once they come in. We’re going to reassign that here. You never know what you might get from different datasets, so you need to be consistent. So we have that, and next we’re going to start stripping the time from it, so we’re going to go minutes = time.strptime()
02:18
and we’re going to pass in minutes
here, and we’re going to split it like so. We’re going to have a percent here, we’re going to go '%M:%S'
.
02:30
That’ll give us a strptime()
representation of our value, and then we’re going to convert it to seconds. So we’re going to go datetime.timedelta()
, and we’re going to pass in minutes=minutes.tm_min
,
02:51
and our seconds
is going to be minutes.tm_seconds
.
03:00
And then we’re going to call .total_seconds
on that returned timedelta
, and that’ll give us the number of seconds in total.
03:10
Let’s quickly test out our function. We’re going to call str_to_seconds()
, we’re going to pass it in 40 minutes,
03:21
and it should return to us 2400
, if my math is correct. Nope, we have an issue here. minute
is not defined. minutes
is what we wanted there. There’s another typo here.
03:43
Perfect. So, we know that our function that converts that object
to str
works, so what we’re going to do next is map that function.
03:53
Let’s call this DataFrame
temp
.
03:58
Let’s call it temp
. We’re going to go temp['MP'] = temp['MP'].map()
,
04:17
we’re going to pass in our str_to_seconds
function. And then we’re going to print out temp.head()
, and as you can see, we’ve converted all of our minutes played to seconds.
04:32
So if we were to go .dtypes
here, what we should get is float, integer, integer—which is perfect. So now we’re able to work with our dataset to calculate field goals attempted and field goals scored per minute.
04:52
I just realized I glossed over an important concept here. What’s actually happening here is that we’re taking the value that is stored per row of the minutes played, and we’re calling .map()
.
05:04
And what .map()
does is it takes the value from that column and passes it to this function, which calculates the minutes and then assigns it to that specific column in the temp
DataFrame
that we just created. So it’ll take each value one by one—so if we’re to just go temp.head()
here for a second real fast, you’ll notice that it’ll take this one particular value, do the calculation, and assign it back to this position in the DataFrame
.
05:33
So as you see, when we look at the .head()
here, we were doing the operation on this one, one by one. I need to run this again to do that. So, one by one, it’ll take the value that’s here, run the calculation, assign it back here. Run the calculation, assign it back here. We can do similarly the same thing when we are going to calculate our field goal attempted per minute.
05:59
Before I show you how to create new columns with new data points, I simply rewrote our previous function and divided it by 60
so that we have our minutes played as a float, which we then can simply divide by and get our field goals per minute and field goals attempted per minute.
06:14
So, we’re going to create a new value. We’re going to take our temp
and we’re going to assign a new column called 'FGA/M'
(field goals attempted per minute).
06:24
In order to calculate values for that, we’re going to go temp['FGA']
06:34
divided by temp['MP']
. That’ll give us a value for the number of field goals attempted per minute for Kevin Durant. So as you can see here, we were able to quickly take field goals attempted and divide it by that value in order to give us field goals attempted per minute.
06:56
So now that we know that Kevin Durant attempts 0.44 field goals per minute, so that can give us a way to bear out an average for when we do that. And then if we’re really interested what we can do is we can go temp.describe()
and see how that ends up becoming over the average.
07:18 So for example, field goals attempted—the mean is 0.45. So he attempts half a field goal every minute in-game, which is interesting considering that we know how many times he takes a shot and how many he scores.
07:32 So what we could do also after that is do something else—we can do field goals per minute scored, and that is simply just a variation on this same calculation,
07:47 simply substituting this out.
07:52
Field goals, and then we can go ahead and .describe()
that. And as you can see here, out of 81 he’s scoring about 0.23 field goals per minute, over the entire 2012-2013 season, which is some valuable information, especially if you’re interested in how many shots a person takes, how frequently they take shots, how does their attempts versus their actual scoring vary.
Orlando Viera on Dec. 22, 2019
For Python 3.7 use the following line: temp.MP = temp.loc[:][‘MP’].map(str_to_seconds)
Orlando Viera on Dec. 22, 2019
You can also use temp.loc[:][‘MP’] = temp.loc[:][‘MP’].map(str_to_seconds)
Kevin Walsh on Feb. 26, 2020
i had the same issue. WHen i ran the dtype command, all of the items came back as object. I tried to go to the CSV file and change to number format, but, no luck. With the attempted field goals and made field goals as object, i can’t do the follow on math or comparison. Is there a way to ensure that when you are importing CSV files into dataframe, you can get the information as integers or floating instead of objects?
Ricky White RP Team on Feb. 27, 2020
Hi Kevin. I’m no Pandas expert, so I asked one to help answer your question. You can use the converters
attribute to specify the type when loading your data from a CSV. Here’s the docs: pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
myPyTeck on March 12, 2020
Thanks for the hint on converters Ricky! I was stuck with some <class ‘str’> in the MP column. Once I had this str_to_seconds function defined, adding an option to read_csv did the job for me finally.
data = pd.read_csv('kevin.csv', names=columns, skiprows=1, converters={'MP':str_to_seconds})
I used skiprows because there were column names in the first line in my csv file.
Ranit Pradhan on April 3, 2020
In Python 3 when I’m using
temp.MP= temp.loc[:][‘MP’].map(str_to_seconds)
Value error is occuring
ValueError: time data ‘Inactive’ does not match format ‘%M:%S’
zbigniewsztobryn on April 26, 2020
I have question about this line:
datetime.timedelta(minutes=minutes.tm_min, secons=minutes.tm_sec).total_seconds()
I completely don’t get how does varible minutes.tm_min gets data out of previous line. Is it predefined somehow?
piotrjubkepka on April 26, 2020
Hi, my problems is every column is Object. Somebody know hot to fix?
MP object
FG object
FGA object
dtype: object
glaucorolim on May 11, 2020
I was having a object type for all variables as well and I simply deleted the last row of the kevin.csv (a lot of strange values…) all is fine now. best!
bwaspe on May 26, 2020
Agree - delete last row of csv with “inactive” and row 0 if you have column names and you will get int64.
bwaspe on May 26, 2020
I also get this error running line 49
ValueError: time data ‘2439.0’ does not match format ‘%M:%S’
brasstrumpetman on April 24, 2022
Referencing around the 4:30 time mark of the tutorial I have typed the following into a jupyter notebook to change the MP values
import time
import datetime
def str_to_seconds(minutes):
minutes = str(minutes)
minutes = time.strptime(minutes, '%M:%S')
return datetime.timedelta(minutes=minutes.tm_min, seconds=minutes.tm_sec).total_seconds()
print (str_to_seconds('40:00'))
temp['MP'] = temp['MP'].map(str_to_seconds)
temp.head()
and see the following output:
2400.0 /var/folders/jp/kkq2q30s39g9m4jnkqk5tgq40000gn/T/ipykernel_32679/3544599389.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy temp[‘MP’] = temp[‘MP’].map(str_to_seconds) MP FG FGA 0 2561.0 7 14 1 2525.0 7 17 2 1749.0 4 11 3 2303.0 11 19 4 2280.0 9 16
End of output
The function works (shown by the first output line) but the dataset does not change.
I am pretty confident I have an exact copy of what Madhi has shown. The only thing I have noticed is when Mahdi uses the map function it appears green in his notebook (shows ‘map’ is a function). In my jupyter notebook the map function is plain black text.
Some help would be appreciated from the RealPyhton crew.
Become a Member to join the conversation.
Donna van Wyk on Nov. 16, 2019
All 4 fields are showing as object now. ? I downloaded as a csv. FG and FGA are int64 in your tutorial though.