Join us and get access to thousands of tutorials and a community of expert Pythonistas.

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Static Websites

Web Scraping With Beautiful Soup and Python Martin Breuss 07:07

00:00 Let’s scrape the website that you looked at before using Python and requests.

00:06 For this, I’m heading over to the second Jupyter Notebook that’s part of the contents in here. Keep in mind that you don’t need to use Jupyter Notebooks—it’s just for convenience. You can use any other text editor, and also we have courses and tutorials about using Jupyter Notebooks or other text editors on the Real Python site.

00:26 Okay, I’m going to make some space here, and we don’t actually need this page anymore, so I will close it out for now so that we can focus on this Notebook here.

00:37 All right. We’re getting ready to scrape the HTML content from a page, and we’re looking at static websites. A static website essentially just means that the server sends you back the information as HTML, and there’s no more processing necessarily that needs to happen for that information to appear in your browser.

00:57 What you need is going to be the requests library. That’s an external library, so it’s not in the Python standard library, which means that you need to install it with pip install requests.

01:07 You can do this with a terminal.

01:11 I already did the process. I created a virtual environment and I installed Python, but if you haven’t done that yet, make sure to look up the resources we have on that. And then go ahead and say python -m pip install requests inside of your virtual environment.

01:30 Since I already have that installed, I’m not going to run this now, and just focus on the code that we need to actually scrape the content from the page. All right.

01:41 So, at first you will have to import the requests library after you’ve installed it, and then you can see here is the URL. That’s the exact URL that we were working with before, so if I pop it over here in the browser bar, you will see it brings up the search results that we were looking at.

02:00 So, this URL essentially encodes the information that we want.

02:05 I’m defining that this is the URL that requests is going to look at, and then I’m just going to say .get() the information from this url.

02:14 You can see requests has a really nice and intuitive API, the way that you write the code. And I’m going to save the HTML that we get back from here into this response object. So, after executing this—and see, it takes a little while. And once it’s finished, I can now run response.content, taking a peek at the content that we got returned from that page.

02:37 So you see there’s some HTML here. This is just the start, starting off with the !DOCTYPE declaration, and then some parts of the <head>.

02:46 You can also inspect this in a different way. This is going to be really long, so that’s why I’m not showing the full thing, but let’s say I’m going to go from 1000 to 2000. What’s in there? And see, a bunch of JavaScript in here because this site doesn’t only contain HTML, but also some JavaScript that gets executed.

03:06 So, somewhere in here—in all of this code that we got returned—is going to be the information that you’re interested in. But how are we going to find that? That’s a different story. But for now, just keep in mind—this worked. You have access to the data from over here.

03:22 You got the content returned and now, how are you going to find the specific information? You could do a string search, right? So, you know you have access to all of this, so you could just use Python’s string method and say, “Let’s .find('python') in here.”

03:38 And what you get returned is a location. I can inspect what is at this location, and you can see that we’re currently looking at the query parameters. So somewhere in here, probably at the beginning—well, we know what’s the location, 463, so I can say, “Give me 400 to 500,”

03:59 and we have a link in there—a link to the RSS feed, actually—that has that exact search query. This is not really what you’re looking for, okay? What you want is the information of the specific jobs with the job titles.

04:13 So yes, you can search for specific keywords like this using a string method, but it’s not very straightforward. You get an index out there, you have to look for that index. Also, this is the first instance of it that it finds, which is not actually what you’re looking for.

04:29 So, it’s far from ideal to extract information. I’m just showing you all of this so that you understand also what’s the power of parsing they we’re going to do in the next part of this course, which is going to make getting out this information much easier. Yeah, so another way to do this would of course be regular expressions.

04:47 You could run something like that and make a search, find all of the 'pythons' that are on this page—and there was a lot of them, you see. But again, this doesn’t really very logically give us the content that you’re looking for.

05:01 Okay. So, this is why after getting the information, you will later need to parse it, and that’s the Part 3 of doing the web scraping pipeline—which is inspect, scrape, and parse—and parsing is going to make this extraction of the information that you’re actually interested in a lot easier.

05:20 We’re going to be using Beautiful Soup for this in the upcoming Part 3. But before we head there, there’s still a bit of information I want to walk you over, and let’s wrap up this video with a quick recap, because the actual thing that we’re doing here—scraping the HTML from the content page—took up, what was it?

05:38 Maybe two seconds of the time in this video? Which is, you need to import requests—make sure that it is installed, then you can import requests—pop in the URL that you’re interested in and say, requests.get(url). That’s the scraping, essentially.

06:19 So, requests is a very, very useful library for web scraping. The important parts here is you need to know the URL. You understand what this means, so you could also play around with this and put in something else, right?

06:32 This is why it’s helpful to understand the connection between the site content and the URL. And then you just run requests.get() and save the output to an object, then you can work forward with it.

Lokalwerk on Oct. 14, 2022

Cloudflare … :-) … Error 1020: Access denied

Become a Member to join the conversation.