Working with Pandas
Here are resources for more information about working with Pandas:
In this video, we’re going to talk about
pandas can load and save Excel files on its own, its very structure is to use
DataFrames for everything so if your end product isn’t a neatly-organized table, it’s probably better to use something like
openpyxl. On the other hand,
pandas allows you to perform many operations on your data, so it may seem like you’re stuck having to choose between which library to use.
In here, create a small
DataFrame. So first,
import pandas as pd. Then, say
data is equal to a dictionary, which will be something like
"Product Name" is going to equal a list,
["Product 1", "Product 2"], and then something like
"Sales Months 1",
you can see that you end up with your little
DataFrame right here. So
openpyxl has a
dataframe_to_rows() function that you can use to easily convert this
DataFrame into a format that
openpyxl can use.
To use this—we’ll say
convert dataframe to worksheet—you can make a new workbook. So say
workbook = Workbook(), and then
sheet is going to be the active sheet, and then
for row in the
dataframe_to_rows() function, pass in the
DataFrame, and then you can say the
index=False because there isn’t an index that we want to use. There is a
header. And then
Just note that the first column is the column name, and there’s not an actual index there. If you had set
True, the first column would be the index, which would be the
0 and the
1 from the
and I’ll do some new imports:
import load_workbook, because this time we’re going to take an existing workbook. So, take your workbook—this will be
load_workbook(), and this time pass in the Amazon data—
"sample.xlsx"—and then grab the active sheet.
Then, you can take the values, so just say
sheet.values, which will just grab everything that’s there, and you can literally just say
df = pd.DataFrame() and pass in
values. So to take a look at this, say
print(), and then print the
df—because there’s like a hundred rows, we’ll just say,
.head(). Run the script and look at that.
Now, unfortunately, with the zoom here, we can only see two of the columns—
pandas has hidden everything else—but if you look, you can see that there’s an index here, which is the
4, but then the headers are also a little off and it’s just another index of
14, while the first row has the actual column names.
04:50 If you want to use the actual headers, you can do that too. We’ll go back to the script and make a couple changes. I don’t remember which column the review ID was in but the mapping file that’s available has the index saved in it, so let’s go ahead and import that.
Now, set the first row as the headers.
cols is going to equal
next(), and then pass in
data—so it’ll grab that first row—and then
data is going to be a list and then pass in the rest of the data.
This will loop through the resulting
data, and for that column where the
REVIEW_ID is, it’ll return that as a column index. Now, when you go to make your
DataFrame, you’re going to add a couple more arguments to it.
and there we go. So now, if you take a look, you’ll only see two columns, still, but the index column has been replaced with that review ID, and the headers have been corrected, so the actual headers are now showing up as the actual headers to the
DataFrame. All right!
So now you should feel comfortable switching back and forth between
pandas, which will let you perform analysis and other operations on the data in the
DataFrame format, but then allow you to style and clean up the resulting worksheet with
openpyxl And with that, you should feel pretty comfortable with
openpyxl to start working with Python to handle Excel files. In the next video, we’re going to take a few minutes to wrap up so you can see all the topics that you’ve learned in this course.
Become a Member to join the conversation.