Checking Out the Website in Your Browser

Martin Breuss

Exercises Course: Introduction to Web Scraping With Python Martin Breuss 02:44

Transcript
Discussion

00:00 The first thing that I always want to do before I start any sort of web scraping project is to check out the website using my browser. I want to understand what the structure is like, and also just view it and see what I can find on there basically.

00:13 So I just opened that URL inside of my browser and you can see we’ve got this skewed picture of the Dionysus and then his name. We’ve got his hometown, favorite animal, and favorite color, right?

00:26 So what we’re interested in is Dionysus, which I can find after Name: so that information is one piece. And then the second piece of information I’m interested in is wine, which I can find after Favorite Color, and then a colon and a space.

00:46 Okay, so that’s me just viewing the page as a normal user, but as a developer, I’m more interested in seeing the markup of the page. And here in my browser, I can do this by right-clicking and going to View Page Source.

01:01 And then I get the raw HTML that builds up this site. You can see this isn’t super pretty HTML, which is also part of the point of this website to show you how to deal with somewhat malformed HTML, right?

01:13 But you’ll also encounter this a lot if you scrape pages from the Internet. There’s a wide range of quality of HTML out there. Okay, in this case, we can see that the name that you’re looking for is inside of an H2 header, level two header.

01:29 We have an image here and we have some line breaks and finally you can find favorite color Wine. Okay? So this is the structure of the website that you have in here.

01:39 Both of the information that we’re interested in are inside of a center tag. And there’s some images, there’s a title. I don’t like to see that all in uppercase,

01:49 but anyways, there’s like just a feeling you get when you look at HTML that’s all not that pretty.

01:55 Anyways, that’s not the point here. So we want to get out this information and again, like it, it should be this Dionysus and it should be this wine. For the first task here, you’re supposed to just scrape the whole page and get the HTML as a one chunk of text.

02:10 So what I want to end up with is essentially this HTML wrapped in triple quotes, if you want, right? This is the first result that I should get by scraping this page.

02:21 And then after that, my next task is going to be to pick out two pieces of information and clean it up at Dionysus and wine, and make sure that I just get that text

02:31 and I have an understanding of the task that I’m up against. And now I’m going to start coding and head back over to my editor and tackle writing the code starting in the next lesson.

Become a Member to join the conversation.