Extract Attributes From HTML Elements

Web Scraping With Beautiful Soup and Python Martin Breuss 08:16

Transcript
Discussion

00:00 And the final piece of information that you will often want to extract from your HTML soup—we’re going to talk about the attributes on an HTML element. A very frequent one is the URL that is part of a link. The link elements have attributes, and the href attribute is what contains the URL, which is actually what helps you to navigate to a different page.

00:22 Let’s look at this on our website. So, heading back over here, we can see that here’s the <h2> that contains the title, right? And we’re looking for this link. I’m able to click on here, and this takes me to—let’s check it out.

00:39 It takes me to a detail page related to that specific job, so I’ve got some more information on the job on here. So, having this link might be something that’s useful from your scrape, right?

00:50 Because the jobs that interest you—maybe you want to follow that link and find out some more information about it. So we want to collect this link, but before, we already got the title from out here—from that specific link element—and by calling just .text on it, we only got the text, right? So, not the actual URL that we need.

01:12 And that is because it is nested inside of the opening tag as an HTML attribute,

01:18 specifically the href one here. So, you can see this value…

01:24 this one, let me see if I can mark it…without going there! Okay, we got it marked now. So, this value is a part of the URL that allows your browser to navigate to that specific page. You might see that this doesn’t really look like a full URL.

01:39 Some pages might have the full URL in there, sometimes it’s just relative URLs and the rest of the page knows how to complete this so that then it ends up with an actual valid URL that it can navigate to.

01:52 Which is a step that you might have to do—or not, depending on your site. As I said already earlier, the sites are very individual, and you always have to specifically look at your page that you’re currently working with. In this case, we get a relative URL and it is nested inside of the <h2> "title".

02:10 We got the link that we’re looking for. Okay, so let’s look at this in code. We still have access to this title_link element, which is what we got further up here, by finding inside of the title the 'a' element—the first 'a' element.

02:27 So I will take another look at it here. Okay, so this is equivalent to what we just looked at over here, right? This is that element, which on our page relates to this title containing both the text as well as the link. So, how can we get the link out of there?

02:44 And you can access specific HTML attributes of an element by using the square-bracket notation, and then the name of the attribute. In this case, it would be 'href'.

02:55 Let’s look at another one first because there’s more of them in there. You can see href is here, but let’s look first at—for example—the id of this element.

03:05 I could run title_link['id']. So, this is my Beautiful Soup object that refers to this link element, and I can access its id using the square-bracket notation. Or something else—what else did we have in here? onclick.

03:22 Anything that shows up as a HTML attribute on that element, you’re going to access with this square-bracket notation. Now, as I mentioned earlier, usually the one that you’re looking for is href, because this will be the URL that you can use to navigate forward to a different page. And—as I also mentioned when we explored it over there—this is not a absolute URL, so this thing by itself, if I put this in my browser, it’s not going to take me anywhere.

03:53 The reason being that this is a relative link, so we need to assemble it to be an absolute URL. Again, you know, what can help a lot in this case is that you go ahead and explore a bit and see what happens when you click on this actually.

04:09 You see that it shows you indeed.com/ and then viewjob and then there’s some query parameters up here.

04:19 This doesn’t seem to directly translate to this, so you might have to try around a little bit to figure out what is the right URL. I’ve done that before. So it’s indeed.com—if I type that in, “indeed.com”—and then paste the remainder that we got from scraping, then it knows how to redirect me to this page, which is the details page of that specific job. Okay, cool!

04:44 So, all you need to do is add this base_url up front and just join it together, which is a simple string concatenation operation here. I’m taking this base_url that’s defined up there and adding this slice of the URL to it, and then we have a working job URL and you can see in here, in Jupyter Notebooks, it even makes it clickable so we can try it out and—there you go! For some reason—

05:12 I’m not entirely sure why—it looks a little different than what we clicked on before, but it does contain the same information.

05:21 All right. So, here we go—a data engineer summer job. This is how we can move forward. Okay, great. So with this, we’re able to access the specific job posting, and then I could do this in the way that I just did here by playing around with the URL, putting it in there, or by just following the link directly.

05:44 And this is the same as just doing it with requests. So you could use requests again and get the information that you just looked at by a simple scrape exercise here, just putting in this job_url after you’ve created this absolute URL from it.

06:00 And then again, you could make another soup with another BeautifulSoup object. This is the same process that we went over before: scraping, creating the soup, and then you’re ready to parse this new page.

06:14 job_soup.text, for example, gives us the whole content of this page…

06:21 just in one go, because we’re calling the .text attribute on the highest level, on the whole HTML that you’re getting back.

06:28 Of course, you could also dig deeper, inspect here again.

06:33 You see it becomes a process that you’re already familiar with. We could see here, for example, how can I get this title? There’s a <p> tag and it has a <strong> tag in there, and all of this sits in the id="content".

06:49 You could address this "content" ID and then maybe drill down. You don’t really see much class attributes in here, so might be a bit more tricky to get to that specific information that you want, but Beautiful Soup also gives you the option to address the fourth <p> element, or the second <strong> element, et cetera. But in this case, you’re interested in the information about the job, so it might just make sense to collect this whole information as we did here and just store the text somewhere so you can read up on it, or maybe search for specific keywords to figure out whether you’re actually interested in a job or not. Okay. So, these are just options, right?

07:29 And I want to give you an outlook into what you can do to dig deeper. You could, for example, filter what I just mentioned. You could set up a pipeline that scrapes all of the URLs of all the jobs, then does another request to get the more detailed content, and then figure out does it have specific keywords in there that you would really be interested in for a job. Maybe you want 'growth opportunity' in there, or something like that.

07:54 So you see, there’s a lot of things that you can do, and we are going to talk a bit more about options and a little task to work on in the next part, which is going to come right after a little wrap up video where I’m just going to walk you again over the different ways of parsing that you’ve seen in this Jupyter Notebook. See you in the next lesson.

Become a Member to join the conversation.