Scrape HTML Content From a Page (Recap)
00:00
All right! So, with this, you’re already wrapping up Part 2 of the web scraping pipeline, where we did the actual scraping, got the information from the web. We focused on static websites, so you learned about how do you get content using Python and requests
, and you learned that it doesn’t take a lot of code. In fact, as long as you have the requests
library installed, you can just import it and then define the URL that you want to scrape the information from, and then just say requests.get()
the information from that specific URL.
00:32
It’s a very intuitive API that requests
provides. It makes it easy to get the information, scrape it from the web. Then, after this, we looked into other situations that are a bit more tricky than the one that’s pretty straightforward with this indeed.com website.
00:50
What if your information is hidden behind a password log in? So, it’s your specific user information maybe that you want to scrape, and you’re sending their request with requests
, but what you get as a result is some error message that tells you that you cannot access this information without a password.
01:08
So, while we’re not going over this in this course, the good news is that requests
can handle this. Just go ahead and look at the resources, and you can learn about how you can log in using requests
and then still use it to scrape the information from the website.
01:24 Then, we looked at a slightly more tricky situation, when the information that the browser sends back is not actually the HTML and the information that you want, but instead it’s a lot, a lot of code without the information that you’re looking for.
01:40
And you look at it for awhile, and then you realize—it’s JavaScript! Ha, but it’s not that bad, but what you have to understand for scraping dynamically-generated content is that requests
alone can’t do it.
01:53 So you can always just get the information that the server sends you back with a request, and in the case of dynamically-generated sites, this is just going to be JavaScript code—not the information you want.
02:05
So, requests
alone can’t do it, but Selenium can, and there are some links and some resources on scraping dynamic websites. I would consider this an advanced topic, so if you’re running into the situation that you want to scrape a site that doesn’t send you back the information directly but generates it dynamically, then I would put this a bit farther down the line.
02:27 First, learn how to do the static scraping, and then move on to scraping with Selenium.
02:33 And that about wraps up Part 2, where you learned about how to scrape content from the web, focusing on how to do it for static websites and giving you some outlook and tools and explanations on where you can look if you have some more challenging examples—for example, password-protected websites, or also dynamically-generated content.
02:54 See you in Part 3, where you will learn about how to parse the information that you just scraped and actually pick out the pieces of information that are interesting to you. See you in Part 3!
Become a Member to join the conversation.