Extract Attributes From HTML Elements
And the final piece of information that you will often want to extract from your HTML soup—we’re going to talk about the attributes on an HTML element. A very frequent one is the URL that is part of a link. The link elements have attributes, and the
href attribute is what contains the URL, which is actually what helps you to navigate to a different page.
Let’s look at this on our website. So, heading back over here, we can see that here’s the
<h2> that contains the title, right? And we’re looking for this link. I’m able to click on here, and this takes me to—let’s check it out.
Because the jobs that interest you—maybe you want to follow that link and find out some more information about it. So we want to collect this link, but before, we already got the title from out here—from that specific link element—and by calling just
.text on it, we only got the text, right? So, not the actual URL that we need.
01:24 this one, let me see if I can mark it…without going there! Okay, we got it marked now. So, this value is a part of the URL that allows your browser to navigate to that specific page. You might see that this doesn’t really look like a full URL.
01:39 Some pages might have the full URL in there, sometimes it’s just relative URLs and the rest of the page knows how to complete this so that then it ends up with an actual valid URL that it can navigate to.
Which is a step that you might have to do—or not, depending on your site. As I said already earlier, the sites are very individual, and you always have to specifically look at your page that you’re currently working with. In this case, we get a relative URL and it is nested inside of the
We got the link that we’re looking for. Okay, so let’s look at this in code. We still have access to this
title_link element, which is what we got further up here, by finding inside of the
'a' element—the first
02:27 So I will take another look at it here. Okay, so this is equivalent to what we just looked at over here, right? This is that element, which on our page relates to this title containing both the text as well as the link. So, how can we get the link out of there?
I could run
title_link['id']. So, this is my Beautiful Soup object that refers to this link element, and I can access its
id using the square-bracket notation. Or something else—what else did we have in here?
Anything that shows up as a HTML attribute on that element, you’re going to access with this square-bracket notation. Now, as I mentioned earlier, usually the one that you’re looking for is
href, because this will be the URL that you can use to navigate forward to a different page. And—as I also mentioned when we explored it over there—this is not a absolute URL, so this thing by itself, if I put this in my browser, it’s not going to take me anywhere.
03:53 The reason being that this is a relative link, so we need to assemble it to be an absolute URL. Again, you know, what can help a lot in this case is that you go ahead and explore a bit and see what happens when you click on this actually.
This doesn’t seem to directly translate to this, so you might have to try around a little bit to figure out what is the right URL. I’ve done that before. So it’s
indeed.com—if I type that in, “indeed.com”—and then paste the remainder that we got from scraping, then it knows how to redirect me to this page, which is the details page of that specific job. Okay, cool!
So, all you need to do is add this
base_url up front and just join it together, which is a simple string concatenation operation here. I’m taking this
base_url that’s defined up there and adding this slice of the URL to it, and then we have a working job URL and you can see in here, in Jupyter Notebooks, it even makes it clickable so we can try it out and—there you go! For some reason—
05:21 All right. So, here we go—a data engineer summer job. This is how we can move forward. Okay, great. So with this, we’re able to access the specific job posting, and then I could do this in the way that I just did here by playing around with the URL, putting it in there, or by just following the link directly.
And this is the same as just doing it with
requests. So you could use
requests again and get the information that you just looked at by a simple scrape exercise here, just putting in this
job_url after you’ve created this absolute URL from it.
And then again, you could make another soup with another
BeautifulSoup object. This is the same process that we went over before: scraping, creating the soup, and then you’re ready to parse this new page.
You see it becomes a process that you’re already familiar with. We could see here, for example, how can I get this title? There’s a
<p> tag and it has a
<strong> tag in there, and all of this sits in the
You could address this
"content" ID and then maybe drill down. You don’t really see much
class attributes in here, so might be a bit more tricky to get to that specific information that you want, but Beautiful Soup also gives you the option to address the fourth
<p> element, or the second
<strong> element, et cetera. But in this case, you’re interested in the information about the job, so it might just make sense to collect this whole information as we did here and just store the text somewhere so you can read up on it, or maybe search for specific keywords to figure out whether you’re actually interested in a job or not. Okay. So, these are just options, right?
And I want to give you an outlook into what you can do to dig deeper. You could, for example, filter what I just mentioned. You could set up a pipeline that scrapes all of the URLs of all the jobs, then does another request to get the more detailed content, and then figure out does it have specific keywords in there that you would really be interested in for a job. Maybe you want
'growth opportunity' in there, or something like that.
07:54 So you see, there’s a lot of things that you can do, and we are going to talk a bit more about options and a little task to work on in the next part, which is going to come right after a little wrap up video where I’m just going to walk you again over the different ways of parsing that you’ve seen in this Jupyter Notebook. See you in the next lesson.
Become a Member to join the conversation.