Extract Text From HTML Elements
Again, let’s start off by exploring a bit. Now we have access to one of these cards, and now let’s see if I can find the title. The title seems to be nested inside of an
<h2> element, so a second-level heading, and then there’s a link in here and it seems like the link has some content—
there it is—which is the actual text that makes up the heading. Okay. So inside of the card, there is a
<h2> element. Inside of the
<h2> element, there’s a link element, and the link element contains the text.
I could also save that to a variable, but we’re just exploring here, so I’m saying, “Give me the first Beautiful Soup object that got returned from before, and in there, find an
<h2> element.” Okay!
So this slims it down quite a bit, but as we saw before,
<h2> still contains a link and a bunch of other attributes on that link. So this is still not quite what we’re looking for, but because always the thing that gets returned from a call like that is another Beautiful Soup element, I can just keep calling
.find() and it’s going to dig deeper. So now on the
title, I’m going to say
.find() the link element. I’m going to print it out.
which obviously is still way too much. But now comes the helpful attribute on every Beautiful Soup object which is just
.text, which gives you the content—so anything that’s in between the tags.
But very often all you want is the text, so if you run
.text on an element, you get the text! And this looks already much more similar to the title that we’re looking for, and you can clean it up a bit with just a normal Python string method here.
So, what I did afterwards is just write a list comprehension for doing all of these steps. So, first finding
<h2>, finding the link, and then getting the
.text of it, and then also cleaning it up for each of the jobs inside of the
03:30 And like this, I could run the thing to get all of the job titles from that specific page. Run this, and here’s the output. You can see, these are all of the jobs that are currently listed on this one search result page. Nice!
03:48 So, the take away from this one is that, first of all, you can keep drilling down because Beautiful Soup keeps returning Beautiful Soup objects. So, all of those methods that you’re going to learn that work on one of them are going to work on the next-down level of Beautiful Soup as well.
So, that’s very helpful, and you can search for an HTML element by just passing in the type, which is similar to what you did up here by passing in the type, but here we specified it a bit more—that it’s only the ones of that type with a specific class. Here, I want to find all of the
<h2> elements in there because there is only one inside of each of those cards.
Become a Member to join the conversation.