Understanding Basic HTML Structure
00:00
Let’s take a look at this example HTML snippet here. Just to revisit the basic pieces that make up an HTML document that are relevant for web scraping. First of all, you have tags which define what elements exist in the document, so in this part of an HTML document, you can see there <div>
tags, there’s an <h2>
tag, there’s a <p>
tag, and then an <a>
tag.
00:22 These are the ones that exist in this little snippet. So these ones are called HTML tags. Then you have attributes which give you additional information about an HTML element, and they are inside of a tag.
00:34
So here you can see the <div>
tag goes from the opening angular brackets to the closing ones and in between you can see there’s two more HTML attributes.
00:45
One is the class attribute that has a value in this case product
and then there’s also the id
attribute. You can see another class attribute here in the <p>
tag and then there’s yet another one that’s called href
, which holds a link.
00:59
So the <a>
tag defines a link element and the href
attribute inside of that <a>
tag points to a certain URL.
01:08 So if you, if you click this element, your browser is going to navigate to that URL.
01:13
Alright, and finally, I’ve already mentioned them, but an HTML element is everything from a start tag to the end tag. So for this <h2>
here that kind of defines the title of this product, what I’ve got highlighted now is the whole <h2>
HTML element that consists of an opening tag, a closing tag, and in this case some content, right?
01:35
The next one, for example, is a p
element that consists again, of an opening <p>
tag and a closing <p>
tag.
01:41
And then in this case it also has an HTML attribute called class
and some content that is the price of the product. So these are just the basic HTML terms that are good to remember because you can use tags and attributes to navigate using Beautiful Soup and you can find specific elements using these tags or attributes.
02:00 And then you can also extract the information in there, for example, the content but sometimes you also want to know the URL, let’s say, right? So you want to be able to extract an HTML attribute as well and that’s all possible with Beautiful Soup.
02:15 One thing you also want to do before actually doing something with the code that you scraped or maybe even before scraping it, is you want to understand the structure of the site that you want to scrape.
02:25 In this case, the URL that I’ve pasted points to this profile page of Dionysus. And you want to use the developer tools to inspect the page and get a feeling for what’s the structure that you’re working with.
02:38 So I want to show you this as well. I’ll head over to this URL and then if you right-click in most browsers, you’ll be able to find an option that says Inspect.
02:48 And when you click that, then it opens up your developer tools from the browser. It’s going to look a little bit different depending on which browser you’re working with, but the concept is always the same.
02:58 It gives you a chance to see the actual HTML that makes up the site that you’re viewing.
03:05 Note that this is not necessarily exactly the HTML that you’re going to get when you scrape it because this is how the Document Object Model looks like once your browser has gotten the HTML and rendered it.
03:18 So it might be a little bit different than what you’re getting when you scrape it, but it’s usually quite similar. And this has features such as like this little arrow up there.
03:27 If you click it, then you can navigate to an element and click that and your browser is going to show you in the developer tools what that element is or also vice versa.
03:36 If I click on “Hometown: Mount Olympus”, then you can see it also highlights it over on the site. So this gives you a great way to further inspect the page and get a feeling for where is the information located that you want to extract from the page, right?
03:52 For example, if I wanted to get the,
03:56
let’s say the name, right, so what’s the name of this person’s profile? I can see that it’s nested inside of an <h2>
tag that’s again, nested inside of a center element on that page.
04:10 So that gives me a way to know how to address the specific information than later on when I’m working with Beautiful Soup. Alright, these are your browser dev tools.
04:20 Very helpful when you do web scraping, and I always suggest that you get an idea and a feel for the website before you start to actually try to scrape within code because it’ll make it much easier to find the input that you need.
04:32 With that said, let’s go back over to the code that you saw before and now you have a better idea to interpret what you’ve seen there. And then let’s get ready to traverse that HTML Parse Tree and find specific information in it.
Become a Member to join the conversation.