Scraping the HTML
00:00 Here I’m using VS Code to solve this task, but you can use any editor that you want. And I’ve also created a new empty folder inside of my Real Python folder that I called web-scraping.
00:12
And in here I’m going to get started writing the code. So first of all, I will start a new Python script. So I’ll right-click, New File, and then I’m going to get name and color So I’ll just name it, get_name_color.py
. As a first step, I’m going to copy-paste the URL that was given to me there on that slide.
00:33
So here I have this URL variable that points to the page that I just looked at in the previous lesson. So this is the page I want to scrape. I need the urllib
library from the Python standard library.
00:47
And specifically I need the urlopen
function. So I’m going to import that by saying from urllib.request
import urlopen
.
00:59
So this is the function that I want to use to scrape the HTML from the URL here in line three. I can do that by calling urlopen
and passing it the URL.
01:11
Then I need to save this to a variable, and I’m going to call that html_page
. So that will be the whole page. And then I want to work with this as a text string, because then I want to use some string methods to actually find the information that I’m interested in.
01:28
So I will decode this into a string and I’ll call it html_text
.
01:35
To do this, I have to call read
on the resulting HTML page.
01:42
So I’ll say html_page.read()
. And also I will decode it. It’s good measure to make sure that it is decoded using “utf-8”. I will also give you some links later on where if you’re curious about why you need to do this, and a couple of deep dive rabbit holes around decoding and encoding.
02:03 For now, I’m just going to decode it like this and take a look.
02:10
If you remember, what I want to get is this html_text
that we’ve inspected in a previous lesson, and I hope that this is what I’m going to get if I print it out.
02:18 Let’s go ahead and run that.
02:24 And this looks pretty good. All right, there it is. So now I have that whole HTML that you saw earlier in the browser. I got it through my script. I scripted and have access to it, and I’m also going to confirm whether it’s actually a string
02:42 because that’s what I want to work with.
02:45 So I run this again and it tells me that, yes, I am actually working with class string, which is great because that means that then I can use string methods to find the pieces of information I’m interested in.
Become a Member to join the conversation.