Hidden Websites
00:00 Let’s—on a high level—talk about the issue that you can run into when the information that you want is password-protected so you need a user account to actually access it.
00:09 I’m calling this hidden websites here in this course.
00:13 So, back over in the Jupyter Notebook, we have a little section about that, which just exposes the problem here. You want to extract some information, but it’s not actually accessible as it was with the Indeed page that you looked at before. You need to log in in order to access it.
00:28
So, there are ways of doing this using requests
and they’re actually pretty straightforward. You just need to provide password authentication, and requests
has great ways of doing that. We also have a tutorial on how to do that.
00:40 So, if the information that you’re looking to scrape is behind some password protection, then make sure to check these ones out. And here, I just want to show you the problems that you might run into.
00:51 So, GitHub has an API where you can get information about the different repositories that are on a user’s account. For example, here is my account and I could get the information of which repositories are on there, but you need to authenticate as the user in order to be able to get this.
01:09
So, if I run a request to this API, it looks all fine, right? There’s no error or anything. But now, when I look at this, I got a 401
, which means that I’m not authorized.
01:21
The response still has a .content
as the one above had, but that .content
is a message that tells me that it requires authentication, and it also gives a helpful link and how you can do that, so make sure to check out the requests
guide that has some information about doing that, if you’re interested in scraping this.
01:38
But essentially, all that requests
can do is simulate your request that the browser otherwise sends. So if I would try to put this into a browser search bar, I get the exact same result.
01:50
This also requires authentication because all that requests
can do is get this exact message, basically, that your browser would receive.
02:00
However, it is possible to use requests
for authentication and as I mentioned, it’s actually pretty easy. Make sure to head over to this guide, that explains in much more detail, how you can solve problems like that and still be able to scrape the information that you’re interested in from the web.
02:17 So, while scraping content that’s hidden behind passwords is still relatively easy to do, scraping dynamically-generated content is a bit of a different story and more complex.
02:28 We’re going to talk about what that means and give you an overview about possible solutions—if this is what you’re looking for—in the next lesson.
Martin Breuss RP Team on Oct. 27, 2021
Hi @benjaminkikirov what are you doing when you encounter this error? Are you running into it when you’re executing a Python script, or does it happen when you’re unzipping the ZIP archive that you downloaded?
It looks like you’re bumping into this while interacting with the ZIP file, but please describe in more detail what you’re doing. That’ll make it easier to help you resolve the error.
If you haven’t unzipped the archive, please make sure to do that. The code files are inside the ZIP archive.
Bartosz Zaczyński RP Team on Oct. 27, 2021
@benjaminkikirov Are you trying to open the ZIP archive directly in JupyterLab? Try extracting the archive’s contents first, and then open the individual .ipynb
notebook files. Hope this helps.
Become a Member to join the conversation.
benjaminkikirov on Oct. 26, 2021
For some reason when I try to pull up the sample code, it says:
File Load Error for build-a-web-scraper(2).zip C:\Users\AZ BEST\Downloads\build-a-web-scraper(2).zip is not UTF-8 encoded