Scrape HTML Content From a Page (Recap)
All right! So, with this, you’re already wrapping up Part 2 of the web scraping pipeline, where we did the actual scraping, got the information from the web. We focused on static websites, so you learned about how do you get content using Python and
requests, and you learned that it doesn’t take a lot of code. In fact, as long as you have the
requests library installed, you can just import it and then define the URL that you want to scrape the information from, and then just say
requests.get() the information from that specific URL.
It’s a very intuitive API that
requests provides. It makes it easy to get the information, scrape it from the web. Then, after this, we looked into other situations that are a bit more tricky than the one that’s pretty straightforward with this indeed.com website.
What if your information is hidden behind a password log in? So, it’s your specific user information maybe that you want to scrape, and you’re sending their request with
requests, but what you get as a result is some error message that tells you that you cannot access this information without a password.
So, while we’re not going over this in this course, the good news is that
requests can handle this. Just go ahead and look at the resources, and you can learn about how you can log in using
requests and then still use it to scrape the information from the website.
01:24 Then, we looked at a slightly more tricky situation, when the information that the browser sends back is not actually the HTML and the information that you want, but instead it’s a lot, a lot of code without the information that you’re looking for.
requests alone can’t do it.
requests alone can’t do it, but Selenium can, and there are some links and some resources on scraping dynamic websites. I would consider this an advanced topic, so if you’re running into the situation that you want to scrape a site that doesn’t send you back the information directly but generates it dynamically, then I would put this a bit farther down the line.
02:27 First, learn how to do the static scraping, and then move on to scraping with Selenium.
02:33 And that about wraps up Part 2, where you learned about how to scrape content from the web, focusing on how to do it for static websites and giving you some outlook and tools and explanations on where you can look if you have some more challenging examples—for example, password-protected websites, or also dynamically-generated content.
02:54 See you in Part 3, where you will learn about how to parse the information that you just scraped and actually pick out the pieces of information that are interesting to you. See you in Part 3!
Become a Member to join the conversation.