Constructing the Scraping Script

Exercises Course: Introduction to Web Scraping With Python Martin Breuss 02:20

00:00 So I want to keep track of the base URL, but I do not want to duplicate “profiles”. So what I am going to do is I will call this one base URL without “profiles” and then when I call url_open(), I’m going to just concatenate “profiles” on top of this up here.

00:18 Alright, so I will use, this is the base URL and I will also remove the forward slash here and add it here for the reason being that when I get returned from the href value includes the forward slash, so ultimately I want to stick it to this base URL and I do not want to duplicate the slash so I should do that.

00:39 And I also need to change URL to base_url here so that fetching the HTML still works. And then I should be good to go to use the nice for loop that I tried out before to find all the a tags and extract the href values.

00:59 So we’ll copy this and just paste it here,

01:04 make sure that the indentation fits. Let me see this are four spaces, so I should be fine, but let’s make sure.

01:12 Okay, so in this case I found the link tags and extracted the href values and then I still need to combine them, but for now I will see if it works as it did before in my playground.

01:26 So I will say python get_links, I’m not running it interactively now. I actually get rid of this print call

01:36 and as an output I get again just the href values of the three links. So that’s great. Now I only need to concatenate them with the base URL, so I can say print(base_url + link_href) and that’s going to combine the two.

01:58 So I will save the file and join it again and I think I should have my solution. http://olympus .real python.org/ slash only one slash, profiles only one time profiles slash and then effort id.

02:12 Okay, great, that’s the solution. Let’s clean it up and double check that it points to the right websites.

Become a Member to join the conversation.