Make a Soup

Web Scraping With Beautiful Soup and Python Martin Breuss 03:03

Transcript
Discussion (2)

00:00 Let’s get set up in the code and do a bit of orientation and then find the first element by ID.

00:09 Over here in Part 3—it’s the third Notebook over here called 03_parse for this step in our web scraping process. I’ll click this away so we have more screen space.

00:21 Here’s another overview of the different topics that we’re going to talk about, and we need to start off by, again, scraping the site. So this is what you learned in Part 2, which is just using requests to get that specific query result and save it to the response object. I’m going to execute this first cell here—this was our scraping step in this case—and now we’re going to start to parse the results that we got back from there.

00:50 So for this, we’re using a library called Beautiful Soup, which is a standard for doing web scraping with Python. It’s very powerful and pretty intuitive, so it’s definitely a good library to know. There’s some other ones out there as well but Beautiful Soup is the defacto standard for web scraping.

01:09 So, I’ll go ahead and import this. I also have this installed in the virtual environment. And then, you are ready to create a soup! Which is Beautiful Soup’s way of parsing through the HTML content so that it then is accessible through intuitive methods and attributes on that object.

01:28 We’re going to look at this more, but for now, the first step is always that you want to pass in the content from your scraping into this constructor and create a BeautifulSoup object.

01:39 And then you’re saving it into some variable name, and by convention this is just going to be soup. So I do this, and now we’ve parsed the content and it’s accessible here.

01:52 You already see that that’s going to be pretty long. I’m going to show you the content of this.

01:57 And you see here that this—it’s a bit better formatted than the stuff that we saw before, but there’s still a lot going on, right? So, this is all of the page content.

02:09 Another way that you can see this exact response is if you head over to the site and then say View Page Source.

02:20 So, this is going to show you exactly the same code. This is what requests scrapes from the web, and then Beautiful Soup—once you parse it—just also can represent it in a bit more nicer formatted way. But otherwise, here, you’re just looking at the same content that requests scraped earlier. However, the soup object that it is now has a bunch of very, very useful methods and ways of interacting with it to pick out the information. We’re going to look at those next.

02:52 So, let’s stop this video here and we’re going to look at how to actually address a specific element by ID in the next lesson.

KatMac on May 5, 2021

I saved the first html page of the indeed search to my computer so that I can practice locally rather than continually scraping the site. I saved the html file to the same folder as my code. I wrote the following code, is there a better way to do this?

# my code
url  = "local_test.html"

with open(url, "r", encoding="utf-8") as f:
    html    = f.read()
    print(html.find("resultsCol")) # found at 4516
    tmp = html[4400:4590]

Martin Breuss RP Team on May 5, 2021

Great idea @KatMac to be respectful towards the site and not scrape it unnecessarily! 👏

You can make your code a little easier by reading in the content of your HTML file and passing the string to Beautiful Soup just like you would with the response object from making a request.

So, this should work:

from bs4 import BeautifulSoup
url  = "local_test.html"

with open(url, "r", encoding="utf-8") as f:
    html = f.read()
    soup = BeautifulSoup(html, "html.parser")

In essence, you’re replacing the step of fetching the HTML from the web by fetching it from the file that you saved it in. All the rest of your parsing should work the same. : )

Become a Member to join the conversation.