Creating a BeautifulSoup Object

Name: Creating a BeautifulSoup Object
Uploaded: 2024-11-05T14:00:00+00:00
Duration: 4 min 6 s
Description: Since this is meant to be a practical introduction, let’s dive right in and then talk about what we’re doing in just a moment. I’ll start by creating a script, which is just going to get us to the point of having a Beautiful Soup object and that’s…

Introduction to Web Scraping With Python Martin Breuss 04:06

Transcript
Discussion (4)

00:00 Since this is meant to be a practical introduction, let’s dive right in and then talk about what we’re doing in just a moment. I’ll start by creating a script, which is just going to get us to the point of having a Beautiful Soup object and that’s what we’re going to work with throughout the course.

00:15 So I’ll create a new file. I’ll just use vim for the sake of speed here and I’ll call it soupy.py.

00:25 And in here I’m going to import the BeautifulSoup object from bs4.

00:34 And then I’ll also import the urlopen method from urllib.request. This is important because like I mentioned previously, Beautiful Soup doesn’t actually scrape the data from the web.

00:46 You’ll need to use something else for that. Beautiful Soup is just there for giving you an interface for parsing and navigating it. So from urllib.request, I’m going to import urlopen,

01:01 and then I need a URL. So we’re going to be scraping http:// olympus.real python.org/profiles/dionysus. A site that’s specifically set up for being scraped. We set that up at Real Python, so feel free to scrape this one without worrying about anything.

01:23 Next, I will use the urlopen() function and pass it the URL. This part is actually what fetches the content from the internet and saves it to the page variable, which then I also need to read and make sure that it’s decoded.

01:40 html = page.read() .decode("utf-8")

01:50 I’m not going to talk about this in more detail here why I’m decoding it, but the idea is just to make sure that there’s no issues with certain characters that could be on the page.

01:59 And if you want to go down that rabbit hole about character encoding, we do have resources on the site. It’s just a lot to talk about and not a lot of code to just add to your page.read().

02:10 Okay, next, we’re in the final step. Here, I’m going to create the Beautiful Soup object by saying BeautifulSoup( and I need to pass it the html that we just fetched from the internet and decode it.

02:26 And then I also need to pass it html.parser. I’m going to use html.parser, which comes included with Python as well. So this is built in, and there’s other ones that you may want to use such as lxml

02:44 or another one is html5lib. But both of those are external parsers that you would need to install they’re third-party libraries. So we are just going to stick with html.parser

02:59 and that’s the whole file. I’m going to save it and exit. Now I can go ahead and run this file in interactive mode as python -i. Adding this -i flag to my Python command is going to run the file and then instead of finishing after it’s come to the last line of code in the script, it’ll drop me into an interactive REPL session and then I can interact with the code in there.

03:31 Okay, you can see I’m in a Python REPL and now I have access to the soup, which is a Beautiful Soup element. And you can see that this contains some HTML that that small script we just wrote fetched from the internet.

03:46 And if you are confused about the HTML that you’re seeing here, just in case you haven’t worked with HTML much before, we’re going to take a step back and take a look at the basic HTML structure so you have an idea of what you’re looking at before working more with the Beautiful Soup object that you just created.

dngrant on Nov. 21, 2024

Hi Martin et. al,

Is there a particular reason why you chose urlopen library as opposed to the requests library to access the web page?

I suppose it doesn’t matter how you get the URL, so long as you get the URL, but I thought I would ask anyway.

Bartosz Zaczyński RP Team on Nov. 21, 2024

@dngrant I can’t speak for Martin, but one advantage of using urllib over reqeusts is that the former ships with Python’s standard library. My guess would be that it was probably easier to skip the installation of a third-party library in this particular course.

Martin Breuss RP Team on Nov. 21, 2024

@dngrant like you say, you could use the third-party Requests library instead of urllib to fetch the HTML content from the web.

The reason that I’m not using it here is just to reduce the amount of external dependencies. Requests is a great library and there’s nothing wrong with using it, but you’ll have to install it separately. urllib ships in Python’s standard library, so using it means one less dependency for your project.

Hope that makes sense :)

dngrant on Nov. 21, 2024

@Bartosz Zaczyński, @Martin Breuss. Makes complete sense. I am enjoying the course and am learning a lot. Thank you for all that you do!

Become a Member to join the conversation.