Loading video player…

Loading Your Dataset

00:00 Let’s load up a dataset. Here’s the URL for a CSV, or comma-separated file, containing basketball data from the website FiveThirtyEight. You can use another package, requests, to download that file.

00:17 requests is a package that wraps the urllib API provided by the Python standard library. It makes networking tasks with HTTP much easier.

00:27 In fact, the author calls it “HTTP for Humans.” If you’ve installed Anaconda, requests is included in the default environment, and if not, you can install it with pip.

00:40 First, import requests.

00:44 Then called the get() function and pass it the download_url. Store the response. Check the .status_code of the response. If it is 200, then everything should be good to go.

00:57 Open a file and write the .content of the response to it. Now the contents of the file are stored locally. Excellent! It’s time to load the CSV file into Pandas. Go ahead and import pandas.

01:13 Notice that the pandas package is aliased as pd. This is not a requirement but it is often how pandas is imported.

01:21 You’ll be making significant use of the pandas package and while shortening the package name by four letters might not seem like a lot right now, over time, it will reduce the amount that you need to type.

01:33 the data set can be loaded from the CSV file. Use the function read_csv() and pass it the path of the CSV file. Look at the type of nba.

01:47 So, what is this DataFrame? You’ll learn more about it later in the course but for now, think of a DataFrame as a way to store tabular data—that is, rows and columns. In fact, you can see how many rows are in the DataFrame by getting its length,

02:05 and you can see there are 126314 rows. The rows and columns can be found in the .shape of nba.

02:16 The .shape attribute is a tuple. The first value is the number of rows and the second value is the number of columns. This means there are 23 columns in the dataset.

02:27 To see the first five rows, get the head of nba. If you wanted to see 10 rows, you could pass 10 to .head(). The default number is 5.

02:38 And you can see one of the benefits of using Jupyter Notebook with Pandas. The Notebook is displayed in a webpage and it takes advantage of rich formatting using HTML, CSS, and in some cases, interactivity with JavaScript.

02:54 The column names are bold and the rows are zebra-striped so they’re easier to distinguish. But where did the column names come from? Go back to the tab with the directory listing. You should see the CSV file.

03:08 Click on it to open it. Notice that the first row of the file contains the column names, also referred to as the header row. By default, the read_csv() function will assume the first row of the CSV file to be the column names.

03:26 Something else interesting about this DataFrame is that not all of the columns are displayed. The columns in the middle have been omitted and an ellipses used as a placeholder to save space.

03:38 You can force Pandas to show all of the columns by setting the maximum number of columns. Also, notice that some of the numeric columns are showing up with six decimal places. Fix the number of decimal places to two with this option.

03:57 Now get the last five rows of the DataFrame with the .tail() function. You can see Pandas has applied the formatting. Also, you can get a specific number of rows using .tail(), the same as with .head(). To get the last 10 rows, pass the value 10 to the function .tail().

04:16 In the next lesson, you’ll start to explore your data using the statistics methods supplied by the DataFrame.

Avatar image for markcerv

markcerv on June 6, 2021

You might want to check that the file you have downloaded is actually a CSV file with data in it. The first lines of the file SHOULD look like:

gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,pts,elo_i,elo_n,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
1,194611010TRH,NBA,0,1947,11/1/1946,1,0,TRH,Huskies,66,1300,1293.2767,40.29483,NYK,Knicks,68,1300,1306.7233,H,L,0.64006501,
1,194611010TRH,NBA,1,1947,11/1/1946,1,0,NYK,Knicks,68,1300,1306.7233,41.70517,TRH,Huskies,66,1300,1293.2767,A,W,0.35993499,
2,194611020CHS,NBA,0,1947,11/2/1946,1,0,CHS,Stags,63,1300,1309.6521,42.012257,NYK,Knicks,47,1306.7233,1297.0712,H,W,0.63110125,
...

When I first tried it, I ended up with HTML content from GitHub telling me, ” (Sorry about that, but we can’t show files that are this big right now.)” – so when I tried to run the command

nba = pd.read_csv('nba_all_elo.csv')

I ended up getting an error inside Pandas:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 80, saw 2

How did I solve this problem? I went to my browser to view the CSV file, got to viewing the actual data, and then saved that data/file as nba_all_elo.csv

Avatar image for Nick

Nick on Sept. 21, 2021

Am I looking in the wrong place or did you not provide the URL for the website in the transcript?

Avatar image for Geir Arne Hjelle

Geir Arne Hjelle RP Team on Sept. 21, 2021

The URL is available on the Description tab.

Avatar image for sarwoo

sarwoo on Nov. 30, 2021

The url in the descrpition for the nba data may download the html page, in which case try the following:

download_url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv'

Avatar image for pnmcdos

pnmcdos on April 6, 2022

yikes. well I get this error when trying to very panda version I’m assuming I need to correct a path or directory. but clueless on how to do it

AttributeError Traceback (most recent call last) C:\Users\PNMCDO~1\AppData\Local\Temp/ipykernel_10256/116837967.py in <module> ----> 1 print(pd.version)

~\Anaconda3\lib\site-packages\pandas__init__.py in getattr(name) 242 return _SparseArray 243 –> 244 raise AttributeError(f”module ‘pandas’ has no attribute ‘{name}’“) 245 246

AttributeError: module ‘pandas’ has no attribute ‘version

Avatar image for pnmcdos

pnmcdos on April 6, 2022

ok. realize my problem!!!!!! it (pd. version) i had pd.version) i only typed one _ when it was two __ now on to the next question, which if we loaded pandas as pd and verified the version. why is it necessary to load it again ?

Avatar image for Douglas Starnes

Douglas Starnes on April 6, 2022

@pnmcdos It’s not necessary to load it again. The duplicate line was made by mistake. You can disregard it.

Avatar image for fuelyou

fuelyou on April 9, 2023

@markcerv Thank you! I just stumbled on the same error and solved it as you said.

Avatar image for nktakumi

nktakumi on July 26, 2023

So I can see the CSV when accessing it on my browser, but with the requests library I’m getting a back a JSON response, see image

Does anyone know how to get the CSV with requests from the Github of current day?

Avatar image for nktakumi

nktakumi on July 26, 2023

Following up the previous message, it seems to work when I add in the same “accept” header value from my browser in the request. So:

headers = {"Accept": "ext/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"}

response = requests.get(download_url, headers=headers)
Avatar image for arthur55

arthur55 on Jan. 7, 2024

Hi there, apologies if the answer is obvious, but why did you need to import pandas twice (in line 1 then line 9)? Did the scope change in between?

Avatar image for arthur55

arthur55 on Jan. 7, 2024

Oops, I think someone else already asked this. My mistake.

Avatar image for nazamrahi

nazamrahi on Feb. 25, 2024

When I try to run:

!dir

I get the following error:

PermissionError: [WinError 5] Access is denied

I’m running the notebook as administrator. Still the same problem. Can you help me out, please?

Avatar image for Evan Davies

Evan Davies on April 11, 2024

I’m typing print(pd._version_) but I’m getting the below error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 print(pd._version_)

AttributeError: module 'pandas' has no attribute '_version_'
Avatar image for Evan Davies

Evan Davies on April 11, 2024

I did put the underscores either side of version but for some reason your comment box won’t let me add them and changes version to italics.

Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on April 12, 2024

@Evan Davies The underscores get interpreted as italics since the comment boxes support the Markdown syntax, which helps embed code snippets. I fixed your comment, but it looks like you need to use double underscores (dunder) instead of single ones:

>>> import pandas as pd

>>> pd.__version__
'2.2.2'

>>> pd._version_
Traceback (most recent call last):
  ...
AttributeError: module 'pandas' has no attribute '_version_'. Did you mean: '__version__'?
Avatar image for Jeff S

Jeff S on May 1, 2024

I had the same error first mentioned by @markcerv. I found that changing “blob” in the original URL to “raw” sorted things out. The url should be github.com/fivethirtyeight/data/raw/master/nba-elo/nbaallelo.csv

Avatar image for ivan.dimitri

ivan.dimitri on Oct. 4, 2024

In order to solve de problem –> “year_id” != year from “date_game”:

nba["date_game"] = pd.to_datetime(nba["date_game"])  
nba["date_game"] = nba["date_game"].astype(str)

def DateRepair(row):
    year_from_date = row["date_game"][0:4]
    if year_from_date != str(row["year_id"]):
                             return year_from_date
    else:
        return fila["year_id"]

Become a Member to join the conversation.