Exploring the Soup in Interactive Mode

Exercises Course: Introduction to Web Scraping With Python Martin Breuss 03:36

00:00 Just as before, I want to again open up Python interpreter to play around with this a little bit before deciding which approach I am going to take to tackle the little tasks that I have.

00:10 And something that I like to use that I find quite useful is that I do not want to start a completely fresh interpreter, but I want to run my script in interactive mode, which means that Python is going to execute the script and then put me into a REPL at the end of it.

00:25 And I can do that by saying python -i and then get links.

00:32 So this -i runs your script and then puts you into a REPL that has understanding of the context of variables that you defined in the script.

00:42 Okay. When I run this, you can see that I get the printout of the type of the object, but now I also have access to the Beautiful Soup object and I can continue working with it, just using my REPL here.

00:54 Okay, so for example, what I want to do, this tasks out here, I want to find the link tags so I can, with Beautiful Soup, I can say soup.find and then pass in a type of tag that I want to find.

01:07 I know that these are a tags, the links, so they’re called a. And if I do soup.find and pass it a, I get the first link object returned, just like that.

01:18 You know, I didn’t have to do anything with indices like I had before, so that’s a lot easier, right? Okay. But it only gives me the first one. But of course there’s also another function here that’s called find_all().

01:30 And then I can pass in the a as an argument here again and this will give me a list of all of the a tags that are inside of this HTML.

01:44 Altight. That brings me pretty close to the solution already, right? With just this one line of code I have access to all of the different a tags.

01:53 Now I do need to get the value of the href attribute out of that. And you can do this by just using square bracket notation actually. So I need to do this for all three that go into a list here.

02:06 So I can loop over that list and say for link in soup.find_all all link tags.

02:19 I want to get the link URL, and that will be extracting href attribute the value of the href attribute from this Beautiful Soup object. Beautiful Soup returns Beautiful Soup objects, which then means you can like find_all returns those.

02:39 So I can do things like using this square bracket notation again, let’s just print those out for now. See what we get there. Print link href, and just like that as you can see, I’m halfway there. I got `profiles /aphrodite, profiles/Poseidon, profiles/ Dionysus. Okay.

03:00 And this also points me to something that I might run into if I don’t think about this more, because currently you can see that my URL includes profiles already, but then the stop URLs here that that point forward also include the /profiles.

03:14 So I might duplicate profiles if I just stick those together. So I will have to edit something around the base URL, and then extract the href of the three links using this for loop.

03:27 And yeah, stick it together and then I think it should be done. So that sounds like a plan. Let’s move on to the next lesson and tackle it.

Become a Member to join the conversation.