Working With Pandas

Editing Excel Spreadsheets in Python With openpyxl Joe Tatusko 07:05

Here are resources for more information about working with Pandas:

00:25 Don’t worry! openpyxl has support to convert a DataFrame into a worksheet or to convert a worksheet into a DataFrame. To try this out, go ahead and install pandas. So, pip install pandas…

00:40 and then make a new Python script. I’m just going to call mine excel_df.py.

00:48 In here, create a small DataFrame. So first, import pandas as pd. Then, say data is equal to a dictionary, which will be something like "Product Name" is going to equal a list, ["Product 1", "Product 2"], and then something like "Sales Months 1",

01:13 which will be another list. So, something like [10, 20], and then "Sales Month 2", which I’ll put in [5, 35]. And I’ll just get rid of that "s".

01:28 Okay. Now you can say df is going to be a pandas DataFrame, and then just pass in data. If you want to take a look here, you can just print out the df.

01:39 If you ran the script…

01:45 you can see that you end up with your little DataFrame right here. So openpyxl has a dataframe_to_rows() function that you can use to easily convert this DataFrame into a format that openpyxl can use.

01:59 Let’s go up here and add some more imports. So, from openpyxl import Workbook and then from openpyxl.utils (utilities) .dataframe import dataframe_to_rows.

02:16 To use this—we’ll say convert dataframe to worksheet—you can make a new workbook. So say workbook = Workbook(), and then sheet is going to be the active sheet, and then for row in the dataframe_to_rows() function, pass in the DataFrame, and then you can say the index=False because there isn’t an index that we want to use. There is a header. And then sheet.append(row).

02:49 I’m going to get rid of this print() statement and then take the workbook and save it. And I’ll just call this "pandas.xlsx".

02:58 Run that, and a new workbook should pop up in your directory, and let’s see what it looks like. Open it up and look at that! Product 1, Product 2, and there’s the sales for months 1 and 2.

03:12 Just note that the first column is the column name, and there’s not an actual index there. If you had set index to True, the first column would be the index, which would be the 0 and the 1 from the DataFrame earlier.

03:26 You just dropped that off by saying False here. So there you go! That’s how you can take a DataFrame and turn it into a worksheet.

03:34 Now, let’s go the other way around and take a worksheet and turn it into a DataFrame. I’m just going to comment out all of this

03:44 and I’ll do some new imports: from openpyxl import load_workbook, because this time we’re going to take an existing workbook. So, take your workbook—this will be load_workbook(), and this time pass in the Amazon data—"sample.xlsx"—and then grab the active sheet.

04:06 Then, you can take the values, so just say sheet.values, which will just grab everything that’s there, and you can literally just say df = pd.DataFrame() and pass in values. So to take a look at this, say print(), and then print the df—because there’s like a hundred rows, we’ll just say, .head(). Run the script and look at that.

04:30 Now, unfortunately, with the zoom here, we can only see two of the columns—pandas has hidden everything else—but if you look, you can see that there’s an index here, which is the 0, 1, 2, 3, 4, but then the headers are also a little off and it’s just another index of 0 through 14, while the first row has the actual column names.

04:50 If you want to use the actual headers, you can do that too. We’ll go back to the script and make a couple changes. I don’t remember which column the review ID was in but the mapping file that’s available has the index saved in it, so let’s go ahead and import that.

05:06 Now, set the first row as the headers. cols is going to equal next(), and then pass in data—so it’ll grab that first row—and then data is going to be a list and then pass in the rest of the data.

05:20 And then if you want to set the index, you can say that the index is going to equal—and then pass in a list comprehension—so it’ll be [row[REVIEW_ID] for row in data].

05:32 This will loop through the resulting data, and for that column where the REVIEW_ID is, it’ll return that as a column index. Now, when you go to make your DataFrame, you’re going to add a couple more arguments to it.

05:44 Say that index is equal to the index that you created, and then columns is going to equal cols, just like that. Save this and try to run it.

05:55 And you get an error here that 'data' is not defined,

06:00 which makes sense because it’s actually called values. So, let’s just change this from values and try to rerun that.

06:09 And if you change one thing, you’ve got to change everything, so now instead of passing in values, go ahead and pass in that resulting data,

06:18 and there we go. So now, if you take a look, you’ll only see two columns, still, but the index column has been replaced with that review ID, and the headers have been corrected, so the actual headers are now showing up as the actual headers to the DataFrame. All right!

Become a Member to join the conversation.