Hidden Websites

Web Scraping With Beautiful Soup and Python Martin Breuss 02:38

Transcript
Discussion (3)

00:00 Let’s—on a high level—talk about the issue that you can run into when the information that you want is password-protected so you need a user account to actually access it.

00:09 I’m calling this hidden websites here in this course.

00:13 So, back over in the Jupyter Notebook, we have a little section about that, which just exposes the problem here. You want to extract some information, but it’s not actually accessible as it was with the Indeed page that you looked at before. You need to log in in order to access it.

00:28 So, there are ways of doing this using requests and they’re actually pretty straightforward. You just need to provide password authentication, and requests has great ways of doing that. We also have a tutorial on how to do that.

00:40 So, if the information that you’re looking to scrape is behind some password protection, then make sure to check these ones out. And here, I just want to show you the problems that you might run into.

00:51 So, GitHub has an API where you can get information about the different repositories that are on a user’s account. For example, here is my account and I could get the information of which repositories are on there, but you need to authenticate as the user in order to be able to get this.

01:09 So, if I run a request to this API, it looks all fine, right? There’s no error or anything. But now, when I look at this, I got a 401, which means that I’m not authorized.

01:21 The response still has a .content as the one above had, but that .content is a message that tells me that it requires authentication, and it also gives a helpful link and how you can do that, so make sure to check out the requests guide that has some information about doing that, if you’re interested in scraping this.

01:38 But essentially, all that requests can do is simulate your request that the browser otherwise sends. So if I would try to put this into a browser search bar, I get the exact same result.

01:50 This also requires authentication because all that requests can do is get this exact message, basically, that your browser would receive.

02:00 However, it is possible to use requests for authentication and as I mentioned, it’s actually pretty easy. Make sure to head over to this guide, that explains in much more detail, how you can solve problems like that and still be able to scrape the information that you’re interested in from the web.

02:17 So, while scraping content that’s hidden behind passwords is still relatively easy to do, scraping dynamically-generated content is a bit of a different story and more complex.

02:28 We’re going to talk about what that means and give you an overview about possible solutions—if this is what you’re looking for—in the next lesson.

benjaminkikirov on Oct. 26, 2021

For some reason when I try to pull up the sample code, it says:

File Load Error for build-a-web-scraper(2).zip C:\Users\AZ BEST\Downloads\build-a-web-scraper(2).zip is not UTF-8 encoded

Martin Breuss RP Team on Oct. 27, 2021

Hi @benjaminkikirov what are you doing when you encounter this error? Are you running into it when you’re executing a Python script, or does it happen when you’re unzipping the ZIP archive that you downloaded?

It looks like you’re bumping into this while interacting with the ZIP file, but please describe in more detail what you’re doing. That’ll make it easier to help you resolve the error.

If you haven’t unzipped the archive, please make sure to do that. The code files are inside the ZIP archive.

Bartosz Zaczyński RP Team on Oct. 27, 2021

@benjaminkikirov Are you trying to open the ZIP archive directly in JupyterLab? Try extracting the archive’s contents first, and then open the individual .ipynb notebook files. Hope this helps.

Become a Member to join the conversation.