Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Hidden Websites

00:00 Let’s—on a high level—talk about the issue that you can run into when the information that you want is password-protected so you need a user account to actually access it.

00:09 I’m calling this hidden websites here in this course.

00:13 So, back over in the Jupyter Notebook, we have a little section about that, which just exposes the problem here. You want to extract some information, but it’s not actually accessible as it was with the Indeed page that you looked at before. You need to log in in order to access it.

00:28 So, there are ways of doing this using requests and they’re actually pretty straightforward. You just need to provide password authentication, and requests has great ways of doing that. We also have a tutorial on how to do that.

00:40 So, if the information that you’re looking to scrape is behind some password protection, then make sure to check these ones out. And here, I just want to show you the problems that you might run into.

00:51 So, GitHub has an API where you can get information about the different repositories that are on a user’s account. For example, here is my account and I could get the information of which repositories are on there, but you need to authenticate as the user in order to be able to get this.

01:09 So, if I run a request to this API, it looks all fine, right? There’s no error or anything. But now, when I look at this, I got a 401, which means that I’m not authorized.

01:21 The response still has a .content as the one above had, but that .content is a message that tells me that it requires authentication, and it also gives a helpful link and how you can do that, so make sure to check out the requests guide that has some information about doing that, if you’re interested in scraping this.

01:38 But essentially, all that requests can do is simulate your request that the browser otherwise sends. So if I would try to put this into a browser search bar, I get the exact same result.

01:50 This also requires authentication because all that requests can do is get this exact message, basically, that your browser would receive.

02:00 However, it is possible to use requests for authentication and as I mentioned, it’s actually pretty easy. Make sure to head over to this guide, that explains in much more detail, how you can solve problems like that and still be able to scrape the information that you’re interested in from the web.

02:17 So, while scraping content that’s hidden behind passwords is still relatively easy to do, scraping dynamically-generated content is a bit of a different story and more complex.

02:28 We’re going to talk about what that means and give you an overview about possible solutions—if this is what you’re looking for—in the next lesson.

Become a Member to join the conversation.