Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Extract Text From HTML Elements

00:00 In this lesson, you want to dig deeper into the HTML that you got returned from the previous lessons and extract just a specific piece of text from it.

00:11 Again, let’s start off by exploring a bit. Now we have access to one of these cards, and now let’s see if I can find the title. The title seems to be nested inside of an <h2> element, so a second-level heading, and then there’s a link in here and it seems like the link has some content—

00:32 there it is—which is the actual text that makes up the heading. Okay. So inside of the card, there is a <h2> element. Inside of the <h2> element, there’s a link element, and the link element contains the text.

00:51 With this understanding, I’m heading back to the code and let’s just go for the first one, the one we inspected before. Remember the jobs from up here.

01:00 Because we used .find_all(), jobs is a list. So if I want to access one single Beautiful Soup element, I can access it via the index on that list.

01:09 I could also save that to a variable, but we’re just exploring here, so I’m saying, “Give me the first Beautiful Soup object that got returned from before, and in there, find an <h2> element.” Okay!

01:21 So this slims it down quite a bit, but as we saw before, <h2> still contains a link and a bunch of other attributes on that link. So this is still not quite what we’re looking for, but because always the thing that gets returned from a call like that is another Beautiful Soup element, I can just keep calling .find() and it’s going to dig deeper. So now on the title, I’m going to say .find() the link element. I’m going to print it out.

01:52 You see it cuts off these parts and anything that happens after the link, and returns to me only the link element here…

02:01 which obviously is still way too much. But now comes the helpful attribute on every Beautiful Soup object which is just .text, which gives you the content—so anything that’s in between the tags.

02:16 So, it cuts off all of these attributes in here—and you’re going to learn later how to specifically pick something out of the attributes, if that’s the information you want.

02:25 But very often all you want is the text, so if you run .text on an element, you get the text! And this looks already much more similar to the title that we’re looking for, and you can clean it up a bit with just a normal Python string method here.

02:40 I’m calling .strip() on it, which takes off the newline character here. And I think that’s all, yeah. But if there would be something at the end, it would also take that off. And here we are!

02:50 We, got the string—the actual title of the job posting. Let’s see which one it is—'Data Engineer'.

03:00 So, by just searching for it, I can say “data”—Engineer Summer Internship, and there it is. So, that’s the element that we’re currently looking at…

03:10 and we’re correctly getting its title.

03:13 So, what I did afterwards is just write a list comprehension for doing all of these steps. So, first finding <h2>, finding the link, and then getting the .text of it, and then also cleaning it up for each of the jobs inside of the job list.

03:30 And like this, I could run the thing to get all of the job titles from that specific page. Run this, and here’s the output. You can see, these are all of the jobs that are currently listed on this one search result page. Nice!

03:48 So, the take away from this one is that, first of all, you can keep drilling down because Beautiful Soup keeps returning Beautiful Soup objects. So, all of those methods that you’re going to learn that work on one of them are going to work on the next-down level of Beautiful Soup as well.

04:04 So, that’s very helpful, and you can search for an HTML element by just passing in the type, which is similar to what you did up here by passing in the type, but here we specified it a bit more—that it’s only the ones of that type with a specific class. Here, I want to find all of the <h2> elements in there because there is only one inside of each of those cards.

04:25 And then I just keep drilling down. Next, I want to only get that link element, and then I want to call this useful attribute .text on it that just gives me the text output.

Become a Member to join the conversation.