Here are resources for more information about working with Pandas:
Working With Pandas
In this video, we’re going to talk about
pandas can load and save Excel files on its own, its very structure is to use
DataFrames for everything so if your end product isn’t a neatly-organized table, it’s probably better to use something like
openpyxl. On the other hand,
pandas allows you to perform many operations on your data, so it may seem like you’re stuck having to choose between which library to use.
openpyxl has support to convert a
DataFrame into a worksheet or to convert a worksheet into a
DataFrame. To try this out, go ahead and install
pip install pandas…
and then make a new Python script. I’m just going to call mine
In here, create a small
DataFrame. So first,
import pandas as pd. Then, say
data is equal to a dictionary, which will be something like
"Product Name" is going to equal a list,
["Product 1", "Product 2"], and then something like
"Sales Months 1",
which will be another list. So, something like
[10, 20], and then
"Sales Month 2", which I’ll put in
[5, 35]. And I’ll just get rid of that
Okay. Now you can say
df is going to be a
DataFrame, and then just pass in
data. If you want to take a look here, you can just print out the
you can see that you end up with your little
DataFrame right here. So
openpyxl has a
dataframe_to_rows() function that you can use to easily convert this
DataFrame into a format that
openpyxl can use.
Let’s go up here and add some more imports. So,
from openpyxl import Workbook and then
from openpyxl.utils (utilities)
.dataframe import dataframe_to_rows.
To use this—we’ll say
convert dataframe to worksheet—you can make a new workbook. So say
workbook = Workbook(), and then
sheet is going to be the active sheet, and then
for row in the
dataframe_to_rows() function, pass in the
DataFrame, and then you can say the
index=False because there isn’t an index that we want to use. There is a
header. And then
I’m going to get rid of this
print() statement and then take the
workbook and save it. And I’ll just call this
02:58 Run that, and a new workbook should pop up in your directory, and let’s see what it looks like. Open it up and look at that! Product 1, Product 2, and there’s the sales for months 1 and 2.
Just note that the first column is the column name, and there’s not an actual index there. If you had set
True, the first column would be the index, which would be the
0 and the
1 from the
You just dropped that off by saying
False here. So there you go! That’s how you can take a
DataFrame and turn it into a worksheet.
Now, let’s go the other way around and take a worksheet and turn it into a
DataFrame. I’m just going to comment out all of this
and I’ll do some new imports:
import load_workbook, because this time we’re going to take an existing workbook. So, take your workbook—this will be
load_workbook(), and this time pass in the Amazon data—
"sample.xlsx"—and then grab the active sheet.
Then, you can take the values, so just say
sheet.values, which will just grab everything that’s there, and you can literally just say
df = pd.DataFrame() and pass in
values. So to take a look at this, say
print(), and then print the
df—because there’s like a hundred rows, we’ll just say,
.head(). Run the script and look at that.
Now, unfortunately, with the zoom here, we can only see two of the columns—
pandas has hidden everything else—but if you look, you can see that there’s an index here, which is the
4, but then the headers are also a little off and it’s just another index of
14, while the first row has the actual column names.
04:50 If you want to use the actual headers, you can do that too. We’ll go back to the script and make a couple changes. I don’t remember which column the review ID was in but the mapping file that’s available has the index saved in it, so let’s go ahead and import that.
Now, set the first row as the headers.
cols is going to equal
next(), and then pass in
data—so it’ll grab that first row—and then
data is going to be a list and then pass in the rest of the data.
And then if you want to set the index, you can say that the index is going to equal—and then pass in a list comprehension—so it’ll be
[row[REVIEW_ID] for row in data].
This will loop through the resulting
data, and for that column where the
REVIEW_ID is, it’ll return that as a column index. Now, when you go to make your
DataFrame, you’re going to add a couple more arguments to it.
index is equal to the index that you created, and then
columns is going to equal
cols, just like that. Save this and try to run it.
And you get an error here that
'data' is not defined,
which makes sense because it’s actually called
values. So, let’s just change this from
values and try to rerun that.
And if you change one thing, you’ve got to change everything, so now instead of passing in
values, go ahead and pass in that resulting
and there we go. So now, if you take a look, you’ll only see two columns, still, but the index column has been replaced with that review ID, and the headers have been corrected, so the actual headers are now showing up as the actual headers to the
DataFrame. All right!
So now you should feel comfortable switching back and forth between
pandas, which will let you perform analysis and other operations on the data in the
DataFrame format, but then allow you to style and clean up the resulting worksheet with
openpyxl And with that, you should feel pretty comfortable with
openpyxl to start working with Python to handle Excel files. In the next video, we’re going to take a few minutes to wrap up so you can see all the topics that you’ve learned in this course.
Become a Member to join the conversation.