Navigating the Parse Tree
00:00
Let’s start off by looking at how you can navigate the parse tree using Beautiful Soup. And as you’ve seen before, I ran my program using the -i
flag, that just drops you into an interactive session where you can then explore your objects better.
00:15 And specifically I’m using an alternative Python REPL. If you’re wondering why some of the code is gonna be color for me, and it’s probably not going to have it for you, this is because I’m using PT Python, which is an alternative shell.
00:28 And I’m only using this to make it easier for you to read what’s going on here, but the functionality is going to be exactly the same. So if you’re interested in alternative Python REPL, checkout the resources we have on the site on PT Python, otherwise you can just do the exact same things using the default REPL.
00:46 It’ll just look a little less pretty. Okay, so we want to navigate the Parse Tree that we got from creating this Beautiful Soup object from the HTML content that you fetched from the internet.
00:59
And the strength of Beautiful Soup is really in an accessible interface. So if I, for example, want to get the title of that page at the very top, it says <title> and the title of the page is Profile:
Dionysus
, right?
01:15
I want get access to this. You can do it by just saying soup.title
and that gives you back the title HTML element, complete element. And that’s a very intuitive syntax, right?
01:29 It’s like accessing an attribute on an object in Python. That’s exactly what you do with this Beautiful Soup object.
01:36
Most of these commands return again, a Beautiful Soup object so this soup.title
is not just a string, but it is again, a Beautiful Soup object.
01:46
And specifically a bs4.
element.Tag
object, right? And all of these Beautiful Soup objects have methods on them. One that is convenient for example, is the .get_text()
method.
02:00
So I can say soup.title.get
_text()
and call this method. And then I actually just get back the string that is the content of this HTML tag.
02:13
So that’s a way that you can navigate by using .
at the name of the tag. I can also find another one. Let’s say I want to find the, an image tag can say soup.img
.
02:23 This gives me the first image tag in my HTML. Not that it only gives you the first one. So if you remember, there are two image tags in this specific HTML.
02:36 Here’s one, and here’s another one. One points to the image of Dionysus and one is the image of some grapes. And using that dot syntax, you can always just get the first one.
02:47
So soup.img
is always going to give me the first instance of an image that you can find. If you want to search the Parse Tree and find, for example, the second image, then you’ll have to do that differently.
02:59 But this dot syntax is very helpful for elements where you want to get the first one or you know there’s only one such element in the whole document. For example, the title element should only exist once.
03:11 Then this is a convenient way of accessing it.
03:15
You can also find objects relative to the one that you’re currently looking at by using .parent
or .children
. So for example, I can get the parent element of <title> by saying soup.title.parent
.
03:29 And that gives me back the <head> element, which contains the <title> element. So this whole thing that I’ve highlighted now that it returns is the <head> element,
03:40
or you can get the children, let’s look at one that’s a little more interesting. soup.
, what’s it called? center
, contains a lot of other tags.
03:51
So if I say soup.center.children
,
03:57
I get an iterator object, which means that if you actually want to see them, you need to consume the iterator. You can do that, for example, by passing it to the list()
constructor, center.children
04:10 and here you can see there’s a couple of elements that you get back, some break tags and then here’s the first image element, and then here’s another image element, etc.
04:22 So this is a way that you can get the children of an HTML element in your parse tree.
04:28
And finally, another way of navigating is maybe you want to get access to the value of a certain HTML attribute. Let’s say you want to get this string "/static/
dionysus.jpg"
.
04:41
So the path to the image of the Dionysus’ profile picture. And you know it’s nested inside of the first image tag as a value to the src
attribute.
04:52
Then using Beautiful Soup, you can say soup.img
, that will give you the first image and then you can use dictionary syntax, which means square brackets and then quotes in there where you can give the name of the attribute that you want to get the value of.
05:07
So I’m saying soup.img
[“src”], close the square brackets. There we go. And this will give me as an output the value of the source attribute. So that’s, for example, a way that you can get the URLs of links that are on a page.
05:29
You could use soup.a
if there’s a link element. I don’t think we have one in here. Let’s take a look. Yeah, so there’s no specific link element, but often on a page you would find a
elements which contain links.
05:41
And then you can access there the value of the links by saying soup.a
square brackets and then put in the "href"
attribute.
05:52
Here we’re getting an error because we don’t have an a
element, but you can see it working before using the source of the image. Okay, so these are ways that you can use to navigate the parse tree.
06:04 The three main things that we discussed here is you can use dot syntax to get to the first element of a certain type. You can navigate relatively to an element by using, for example, parent or children.
06:20 And finally, that you can also access attributes
06:25 by using dictionary syntax
06:28 and giving the name of the attribute that you want to access.
06:31 Next, let’s take a look at what you can do, if you want to find a specific element, for example, here, how do you get to that second image element?
Become a Member to join the conversation.