Constructing the Scraping Script
00:00
So I want to keep track of the base URL, but I do not want to duplicate “profiles”. So what I am going to do is I will call this one base URL without “profiles” and then when I call url_open()
, I’m going to just concatenate “profiles” on top of this up here.
00:18
Alright, so I will use, this is the base URL and I will also remove the forward slash here and add it here for the reason being that when I get returned from the href
value includes the forward slash, so ultimately I want to stick it to this base URL and I do not want to duplicate the slash so I should do that.
00:39
And I also need to change URL
to base_url
here so that fetching the HTML still works. And then I should be good to go to use the nice for
loop that I tried out before to find all the a
tags and extract the href
values.
00:59 So we’ll copy this and just paste it here,
01:04 make sure that the indentation fits. Let me see this are four spaces, so I should be fine, but let’s make sure.
01:12
Okay, so in this case I found the link tags and extracted the href
values and then I still need to combine them, but for now I will see if it works as it did before in my playground.
01:26
So I will say python get_links
, I’m not running it interactively now. I actually get rid of this print call
01:36
and as an output I get again just the href
values of the three links. So that’s great. Now I only need to concatenate them with the base URL, so I can say print(base_url
+ link_href)
and that’s going to combine the two.
01:58
So I will save the file and join it again and I think I should have my solution. http://olympus
.real python.org/
slash only one slash, profiles only one time profiles slash and then effort id.
02:12 Okay, great, that’s the solution. Let’s clean it up and double check that it points to the right websites.
Become a Member to join the conversation.