Loading video player…

Scraping the HTML

00:00 Here I’m using VS Code to solve this task, but you can use any editor that you want. And I’ve also created a new empty folder inside of my Real Python folder that I called web-scraping.

00:12 And in here I’m going to get started writing the code. So first of all, I will start a new Python script. So I’ll right-click, New File, and then I’m going to get name and color So I’ll just name it, get_name_color.py. As a first step, I’m going to copy-paste the URL that was given to me there on that slide.

00:33 So here I have this URL variable that points to the page that I just looked at in the previous lesson. So this is the page I want to scrape. I need the urllib library from the Python standard library.

00:47 And specifically I need the urlopen function. So I’m going to import that by saying from urllib.request import urlopen.

00:59 So this is the function that I want to use to scrape the HTML from the URL here in line three. I can do that by calling urlopen and passing it the URL.

01:11 Then I need to save this to a variable, and I’m going to call that html_page. So that will be the whole page. And then I want to work with this as a text string, because then I want to use some string methods to actually find the information that I’m interested in.

01:28 So I will decode this into a string and I’ll call it html_text.

01:35 To do this, I have to call read on the resulting HTML page.

01:42 So I’ll say html_page.read(). And also I will decode it. It’s good measure to make sure that it is decoded using “utf-8”. I will also give you some links later on where if you’re curious about why you need to do this, and a couple of deep dive rabbit holes around decoding and encoding.

02:03 For now, I’m just going to decode it like this and take a look.

02:10 If you remember, what I want to get is this html_text that we’ve inspected in a previous lesson, and I hope that this is what I’m going to get if I print it out.

02:18 Let’s go ahead and run that.

02:24 And this looks pretty good. All right, there it is. So now I have that whole HTML that you saw earlier in the browser. I got it through my script. I scripted and have access to it, and I’m also going to confirm whether it’s actually a string

02:42 because that’s what I want to work with.

02:45 So I run this again and it tells me that, yes, I am actually working with class string, which is great because that means that then I can use string methods to find the pieces of information I’m interested in.

02:58 Alright, so let’s get started on the next lesson.

Become a Member to join the conversation.