Manipulating the Data
00:00 Now that you can pull all the data out of an Excel workbook, let’s take a look at how you can convert that data into useful Python data structures. All the data you’ve returned so far has been in the form of tuples, which can be thought of as just immutable lists.
00:14 This should be relatively straightforward to work with. So, let’s try out an example where you need to extract product info from the spreadsheet and put it into a dictionary where the key is a product ID.
00:25 I’m going to open up the terminal, activate the environment, and start bpython.
00:33
And, like before, from openpyxl import load_workbook
. Let’s say that workbook = load_workbook()
, with that sample spreadsheet, and then sheet
will be workbook.active
.
00:46 So, one way to make this dictionary is to iterate through all the rows, grabbing the product ID, and then collecting the columns that are related to the product to put into another dictionary.
00:56
This dictionary of dictionaries will let you index our product ID and then return all the pertinent info, which is pretty cool. To get started, go ahead and grab all the header values so that you know which columns you are interested in. So, for value in sheet.iter_rows()
—make sure the capitalization’s correct—and you’ll want min_row
equal to 1
, and max_row
is also equal to 1
. This will just grab that first row.
01:22
And, because you’re interested in the values, say values_only=True
. And from here, go ahead and print(value)
. All right. So take a look at all the header titles—so, 'marketplace'
, 'customer_id'
—and you’ll notice that you have 'product_id'
, 'product_parent'
, 'product_title'
, and 'product_category'
. These are conveniently next to each other, so you can quickly pull all of these columns using the min and max column arguments that you used earlier. Let’s take a look. This will be 1
, 2
, 3
, 4
, 5
, 6
, 7
, as the max column.
01:56
So, to test this out, for value in sheet.iter_rows()
—and this time, you’ll say min_row=2
, because you don’t want that header row. And then, the min_col
is going to be equal to 4
, and the max_col
is going to be equal to 7
, and like before, you’ll want values_only
set to True
.
02:20
And from here, go ahead and print the value. Okay! So, that printed out a bunch of stuff here—I’ll scroll to the top—and you should see that you end up with the product ID, the product’s parent, product title—which is a pretty long one—and then the product category, which should be 'Watches'
, for all of these.
02:39 So, this is looking pretty good. Now, it’s just a matter of taking this data and mapping it into a dictionary. I’m going to clear this.
02:49
Let’s put this into a new script so it’s a little easier to follow. I’m going to make a new file. I’ll call this something like parse_product.py
.
03:02
And to get more fancy, let’s go ahead and import json
, and then from openpyxl import load_workbook
. And also like before, let’s say, workbook = load_workbook()
and pass in the filename
of "samples.xlsx"
. sheet
is still the workbook.active
.
03:23
So, the end result: you want to have a dictionary called products
—so, set this up as an empty dictionary, for now—and then, like you did in the interpreter, make a for
loop.
03:32
So, for row in sheet.iter_rows()
, and you’ll pass in the min_row=2
,
03:39
and then the min_col=4
and the max_col=7
. Then, make sure you have the values_only
and set this equal to True
. And now, instead of printing out that row
—and I should probably have an in
up here—you’re going to make a product_id
, which is going to equal the first item in the row,
04:00
and then make the product dictionary. So, just say something like product
and then start a new dictionary, and you want to pass in the "parent"
, which is row[1]
, and then the "title"
, which is row[2]
, and then the "category"
, which is that last item, so at index 3
.
04:22
And now that you have this, you can say that products
and then, at that product_id
, is going to equal product
. So, this line right here has a lot going on with it, very similar-sounding variables—and I’m not helping by not having an s
here.
04:37
So, let’s break this down. products
here is what you defined up here as an empty dictionary. And then now, you’re filling this dictionary with a key that’s equal to a product_id
, and then the value at that key is going to be the product
, which is this dictionary here and has more information on the product.
04:59 So, this is a pretty cool way to take your data and set it up. Like, if you had a product ID of something you pulled out of a database. Now, you could say the product ID, and then that would pull out all the product info in a dictionary format.
05:12 This could be very useful depending on what you’re doing.
05:16
So, after this runs through every row in the worksheet, let’s take this products
dictionary and use json
to get this into a string format. We’ll just print that out, so, print
, json.dumps()
(dump S)—to dump the dictionary into a string—and then pass in the products
dictionary. Save this. We can actually quit out of the interpreter session. And then, let’s try to run it. So say, python parse_product.py
. I got an error, so let’s see. No [...] file or directory
, so if I go up here—so, I named mine sample
. And that’s wrong. All right, let’s save that.
05:57 And this looks pretty good. Let’s go up to the top…
06:03 and here you go! You’ve got a dictionary here,
06:06
and then the first key is this ID, and then that ID represents this value, which is another dictionary, here. It has all the product information, with the "parent"
, "title"
, and "category"
down here. So, this is great! This might be a bit tricky to read printed out in the terminal here, but a JSON output is very useful in a lot of cases, and the dictionary that produced it would also be very useful in further Python scripts.
06:34 In the next video, you’re going to learn how to append data to a spreadsheet, before you learn how to write new spreadsheets.
Pavneet Ghai on April 9, 2022
I am using googlecolab to practice, when I use
for value in SHEET.iter_rows(min_row=2, min_col=4,max_col=7, values_only=True):
print(value)
Note not using max_row, I am getting none values too Sharing googlecolab NB: colab.research.google.com/drive/1ng2YXhxQYOr_AaJMvtbd1XwEi-edRxZg?usp=sharing
Dawn0fTime on Jan. 21, 2024
You can make the json output much easier to read with indentation. Edit line 22 as follows:
print(json.dumps(products, indent=4))
Become a Member to join the conversation.
Oren Wolfe on Sept. 2, 2020
Your On-Screen ‘parse_product.py’ does not match the file in ‘openpyxl_sample_code.zip’, and the sample_code version does not work. Who ya gonna call?