Creating a BeautifulSoup Object
00:00 Since this is meant to be a practical introduction, let’s dive right in and then talk about what we’re doing in just a moment. I’ll start by creating a script, which is just going to get us to the point of having a Beautiful Soup object and that’s what we’re going to work with throughout the course.
00:15
So I’ll create a new file. I’ll just use vim
for the sake of speed here and I’ll call it soupy.py
.
00:25
And in here I’m going to import the BeautifulSoup object from bs4
.
00:34
And then I’ll also import the urlopen
method from urllib.request
. This is important because like I mentioned previously, Beautiful Soup doesn’t actually scrape the data from the web.
00:46
You’ll need to use something else for that. Beautiful Soup is just there for giving you an interface for parsing and navigating it. So from urllib.request
, I’m going to import urlopen
,
01:01
and then I need a URL. So we’re going to be scraping http://
olympus.real
python.org/profiles/dionysus
. A site that’s specifically set up for being scraped. We set that up at Real Python, so feel free to scrape this one without worrying about anything.
01:23
Next, I will use the urlopen()
function and pass it the URL. This part is actually what fetches the content from the internet and saves it to the page
variable, which then I also need to read and make sure that it’s decoded.
01:40
html = page.read()
.decode("utf-8")
01:50 I’m not going to talk about this in more detail here why I’m decoding it, but the idea is just to make sure that there’s no issues with certain characters that could be on the page.
01:59
And if you want to go down that rabbit hole about character encoding, we do have resources on the site. It’s just a lot to talk about and not a lot of code to just add to your page.read()
.
02:10
Okay, next, we’re in the final step. Here, I’m going to create the Beautiful Soup object by saying BeautifulSoup(
and I need to pass it the html
that we just fetched from the internet and decode it.
02:26
And then I also need to pass it html.parser
. I’m going to use html.parser
, which comes included with Python as well. So this is built in, and there’s other ones that you may want to use such as lxml
02:44
or another one is html5lib
. But both of those are external parsers that you would need to install they’re third-party libraries. So we are just going to stick with html.parser
02:59
and that’s the whole file. I’m going to save it and exit. Now I can go ahead and run this file in interactive mode as python -i
. Adding this -i
flag to my Python command is going to run the file and then instead of finishing after it’s come to the last line of code in the script, it’ll drop me into an interactive REPL session and then I can interact with the code in there.
03:31
Okay, you can see I’m in a Python REPL and now I have access to the soup
, which is a Beautiful Soup element. And you can see that this contains some HTML that that small script we just wrote fetched from the internet.
03:46 And if you are confused about the HTML that you’re seeing here, just in case you haven’t worked with HTML much before, we’re going to take a step back and take a look at the basic HTML structure so you have an idea of what you’re looking at before working more with the Beautiful Soup object that you just created.
Become a Member to join the conversation.