Reading CSVs With Pandas
This lesson covers a couple different ways to import CSV data into the third party Pandas library. In this video, you’ll learn how to install pandas using pip and, how to use it to read CSV files.
Here’s the example CSV file you’ll be using (hrdata.csv
):
Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8
The following example shows how to read a CSV file and print out its contents using pandas:
import pandas as pd
data_frame = pd.read_csv('hrdata.csv')
print(data_frame)
In addition to learning how to read CSV files and printing their contents, you will see how to use pandas to modify the index on the files you read, parse dates and also how to add headers to CSV files without one.
00:00 One way to work with CSVs in Python is to use the data analysis library Pandas, short for panel data. It’s sometimes referred to as the Excel of Python as it stores data in DataFrames, which can be thought of as Excel spreadsheets.
00:13
Before you can do anything with Pandas, you need to install it, so go to your terminal and put in pip install pandas
,
00:21 and if you don’t already have it, that should go ahead and get it for you.
00:26
So now in your editor, you can import pandas as pd
, and let’s take a look at the data that we’re going to be working with. Over here, I have a file called hrdata.csv
, and it just has the names, hire dates, salary, and sick days remaining for a number of employees. To load this into pandas
, just go back, create a DataFrame
that you can just call df
, set that equal to pd.read_csv()
, pass in the filename, 'hrdata.csv'
, and you can print that out just by calling a print()
on the DataFrame
.
01:05 Just try running that, and there you go! You can see that everything imported. Looks like there was an issue here. Let’s go back to the CSV, and it looks like I put a period instead of a comma there. So we’ll save that. And just to be safe, let’s rerun it.
01:23
There we go. So, pandas
went ahead and looked at the first row so it knew what the header titles were, and it took all of the numerical data and turned those into numbers.
01:32
There are two things that we could improve here. If you’ll notice, the index is just a zero index, it’s pretty arbitrary. And maybe we want to use the Name
column to be the index for this data.
01:44
So, this is pretty straightforward to fix. Just go back here when you read the CSV and add in a parameter called index_col
and just set this equal to 'Name'
, just like that. Alrighty. Let’s rerun that.
02:00
Cool. And now you can see the Name
has kind of been brought over further, and the data actually starts with the Hire Date
, Salary
, and Sick Days Remaining
.
02:08
The other problem is that even though the Salary
and Sick Days Remaining
were converted to numerical data, the Hire Date
is still stored as a string here.
02:17
And you can see this if you do a print(type())
on the df
—we’ll just pull out the 'Hire Date'
column
02:26 and index the first item off of there.
02:30
And I typed 'Data'
instead of 'Date'
. So, I’ll save that. There we go. And you can see that we have a string here for the date.
02:40
Fortunately, pandas
has us covered here as well. We just have to pass in another parameter, we can say parse_dates
—and because you may have multiples in here, we’ll pass in a list and just say ['Hire Date']
. Okay, let’s try to rerun that.
03:00
And you can see the formatting change—now these are dashes instead of the slashes, and it’s year-month-day, so that’s good. And our type here now is a pandas
Timestamp
.
03:10 Now, let’s say the CSV did not have that header row. You can go ahead and add that when you read in the CSV, and you just have to make a couple changes here—so, I’ll actually bring these down
03:22
to make this a little easier to read. Because this one already has header information, you can pass in header=0
to ignore it, and we’ll add our own in. And just say names
and we’ll pass in a list that’ll just be ['Employee', 'Hired', 'Salary', 'Sick Days']
.
03:53
Now, because these will become the new headers, we need to change these to make sure they match up. So 'Name'
is now 'Employee'
, and 'Hire Date'
is now 'Hired'
.
04:03 So, save that—oh, we would probably change this one too. Actually, let’s just get rid of that line. Save that, rerun it. Then you can see you have the new header information here. And that’s it!
04:17
That’s a couple different ways to import CSV data into pandas
. In the next video, we’ll talk about how to write CSVs using pandas
.
jamesbrown68 on July 27, 2020
At 1:59, after “index_col=’Name’” was added, I noticed that in the printed output, ‘Name’ was on a separate line from the rest of the column headers. Obviously this is an effect of the column ‘Name’ being used as the index, but it breaks the output in a small manner.
sweir12525 on Sept. 23, 2020
Can you please tell us what project terminal you are using. It isn’t Thonny or PyCharm. This is the second time I requested this. Cannot follow your lesson as none of those programs can grab a csv file directly. They have to be read off the c drive of my computer.
Bartosz Zaczyński RP Team on Sept. 24, 2020
A .csv
file is just a plain text file formatted according to a few rules. You can use pretty much any text editor, from Notepad to PyCharm, to open it. Many editors and IDEs come with plugins that can display such files with syntax highlighting or even in tabular form.
Ghani on Oct. 31, 2020
Thanks for this nice course!
tonypy on March 20, 2023
Is there any benefit in parsing the data using
df[“Hire Date”] = pd.to_datetime(df[“Hire Date”], format='%m/%d/%y')
as opposed to
df = pd.read_csv(csv, parse_dates=[“Hire Date”])
tonypy on March 20, 2023
Using
df[“Hire Date”] = pd.to_datetime(df[“Hire Date”]).dt.strftime('%d-%b-%Y')
allows the date data to be formatted as desired (i.e. dd-mmm-yyyy) but this means going from a string to datetime64 and back to a string which seems nonsensical.
Is there a different / better approach to achieve a different format in the displayed output e.g. dd-mmm-yyyyy, whilst keeping the internal representation as datetime64?
I looked at the option of
df.style.format({"Hire Date": lambda t: t.strftime("%d-%b-%Y")})
but I ended up with <pandas.io.formats.style.Styler object at 0x000001811CCD5B90> which I couldn’t find a way to display in the console.
Become a Member to join the conversation.
Gaetano Barreca on Jan. 10, 2020
I tried to installed Pandas without result till I found this:
pip3 install pandas