Working With Pandas
Here are resources for more information about working with Pandas:
00:00
In this video, we’re going to talk about pandas
. While pandas
can load and save Excel files on its own, its very structure is to use DataFrame
s for everything so if your end product isn’t a neatly-organized table, it’s probably better to use something like openpyxl
. On the other hand, pandas
allows you to perform many operations on your data, so it may seem like you’re stuck having to choose between which library to use.
00:25
Don’t worry! openpyxl
has support to convert a DataFrame
into a worksheet or to convert a worksheet into a DataFrame
. To try this out, go ahead and install pandas
. So, pip install pandas
…
00:40
and then make a new Python script. I’m just going to call mine excel_df.py
.
00:48
In here, create a small DataFrame
. So first, import pandas as pd
. Then, say data
is equal to a dictionary, which will be something like "Product Name"
is going to equal a list, ["Product 1", "Product 2"]
, and then something like "Sales Months 1"
,
01:13
which will be another list. So, something like [10, 20]
, and then "Sales Month 2"
, which I’ll put in [5, 35]
. And I’ll just get rid of that "s"
.
01:28
Okay. Now you can say df
is going to be a pandas
DataFrame
, and then just pass in data
. If you want to take a look here, you can just print out the df
.
01:45
you can see that you end up with your little DataFrame
right here. So openpyxl
has a dataframe_to_rows()
function that you can use to easily convert this DataFrame
into a format that openpyxl
can use.
01:59
Let’s go up here and add some more imports. So, from openpyxl import Workbook
and then from openpyxl.utils
(utilities) .dataframe import dataframe_to_rows
.
02:16
To use this—we’ll say convert dataframe to worksheet
—you can make a new workbook. So say workbook = Workbook()
, and then sheet
is going to be the active sheet, and then for row
in the dataframe_to_rows()
function, pass in the DataFrame
, and then you can say the index=False
because there isn’t an index that we want to use. There is a header
. And then sheet.append(row)
.
02:49
I’m going to get rid of this print()
statement and then take the workbook
and save it. And I’ll just call this "pandas.xlsx"
.
02:58 Run that, and a new workbook should pop up in your directory, and let’s see what it looks like. Open it up and look at that! Product 1, Product 2, and there’s the sales for months 1 and 2.
03:12
Just note that the first column is the column name, and there’s not an actual index there. If you had set index
to True
, the first column would be the index, which would be the 0
and the 1
from the DataFrame
earlier.
03:26
You just dropped that off by saying False
here. So there you go! That’s how you can take a DataFrame
and turn it into a worksheet.
03:34
Now, let’s go the other way around and take a worksheet and turn it into a DataFrame
. I’m just going to comment out all of this
03:44
and I’ll do some new imports: from openpyxl
import load_workbook
, because this time we’re going to take an existing workbook. So, take your workbook—this will be load_workbook()
, and this time pass in the Amazon data—"sample.xlsx"
—and then grab the active sheet.
04:06
Then, you can take the values, so just say sheet.values
, which will just grab everything that’s there, and you can literally just say df = pd.DataFrame()
and pass in values
. So to take a look at this, say print()
, and then print the df
—because there’s like a hundred rows, we’ll just say, .head()
. Run the script and look at that.
04:30
Now, unfortunately, with the zoom here, we can only see two of the columns—pandas
has hidden everything else—but if you look, you can see that there’s an index here, which is the 0
, 1
, 2
, 3
, 4
, but then the headers are also a little off and it’s just another index of 0
through 14
, while the first row has the actual column names.
04:50 If you want to use the actual headers, you can do that too. We’ll go back to the script and make a couple changes. I don’t remember which column the review ID was in but the mapping file that’s available has the index saved in it, so let’s go ahead and import that.
05:06
Now, set the first row as the headers. cols
is going to equal next()
, and then pass in data
—so it’ll grab that first row—and then data
is going to be a list and then pass in the rest of the data.
05:20
And then if you want to set the index, you can say that the index is going to equal—and then pass in a list comprehension—so it’ll be [row[REVIEW_ID] for row in data
].
05:32
This will loop through the resulting data
, and for that column where the REVIEW_ID
is, it’ll return that as a column index. Now, when you go to make your DataFrame
, you’re going to add a couple more arguments to it.
05:44
Say that index
is equal to the index that you created, and then columns
is going to equal cols
, just like that. Save this and try to run it.
05:55
And you get an error here that 'data' is not defined
,
06:00
which makes sense because it’s actually called values
. So, let’s just change this from values
and try to rerun that.
06:09
And if you change one thing, you’ve got to change everything, so now instead of passing in values
, go ahead and pass in that resulting data
,
06:18
and there we go. So now, if you take a look, you’ll only see two columns, still, but the index column has been replaced with that review ID, and the headers have been corrected, so the actual headers are now showing up as the actual headers to the DataFrame
. All right!
06:35
So now you should feel comfortable switching back and forth between openpyxl
and pandas
, which will let you perform analysis and other operations on the data in the DataFrame
format, but then allow you to style and clean up the resulting worksheet with openpyxl
And with that, you should feel pretty comfortable with openpyxl
to start working with Python to handle Excel files. In the next video, we’re going to take a few minutes to wrap up so you can see all the topics that you’ve learned in this course.
Become a Member to join the conversation.