Dynamic Websites

Web Scraping With Beautiful Soup and Python Martin Breuss 05:21

Transcript
Discussion (3)

00:00 Let’s next look at scraping dynamic websites, and some troubles that you might run into and some—maybe—unexpected things if you’ve seen how easy it is to scrape the content of a static website.

00:13 So, what does it mean that content is dynamically generated from a website? If I tried to sum it up, essentially, it means that the server, instead of sending you back the actual information that you want, it sends you back some JavaScript code.

00:27 And the reason for that is that it’s just quicker to send just a bit of code that later generates some data inside of your browser, rather than giving you all the information of the page. This essentially outsources and offloads some of the computing power onto the browsers of the user that is actually wanting to access the data, and thus takes it away from the server of the web application that you’re interacting with. This is important if you want to scale to millions and millions of users, then it can keep your infrastructure smaller and just make the people who want the access to your page, essentially, do the work in order to get the data.

01:05 So, this is a smart move from the company’s perspective because it puts the work that needs to be done onto your browser rather than their server, but it makes scraping the page more difficult. And why is that?

01:21 If you think about Twitter, for example, as an example of a page that dynamically generates information,

01:29 if I search for “realpython” on Twitter, I get all this information. All the tweets are here.

01:35 So, it looks like this could be similar to how it was with indeed.com—but it is not. We do have the query parameter up here that tells us we’re searching for realpython on the Twitter domain, but now if I run this in requests—

01:51 it seems all fine, there is no problem, I can check the status codes. So also, here we have a 200, which means it’s a success, so there’s no authorization problem here. We’re getting everything, so you might expect that it’s fine and that you have all the data, all the tweets that you’re interested in. However, when you run this, you can see there’s a lot of code—I’m printing out all of it right now.

02:14 Feel free to look into this more, but I can also tell you that all of this stuff that’s in here is just JavaScript code and some HTML and some CSS, but not the information that you’re looking for, because what it sends you back is this code stuff, here, that then your browser executes to actually pull the specific tweets and the information that you’re looking for. So, let’s see—regex.

02:38 If I go in here and I search for “regex”—

02:43 ha, that’s just part of our Notebook. But in all of this information here, I can’t find this word, even though it clearly is part of one of those tweets.

02:55 So the problem is that this information simply isn’t there yet. All I’m getting with this long, long, long string of code is the instructions on how to fetch that information from the server and how to generate it using JavaScript and the computing power of your own browser. So, I hope with this, you can see that there’s some troubles that you run into when the content is dynamically generated, and you can’t just simply requests.get() the information because the response that you get—even though it’s all fine—it’s not going to be the content and information that you’re looking for.

03:33 So, scraping dynamic websites is a bit more advanced, but there are obviously ways of doing this and I’ve added some links here. You can check out requests-html, which is from the same team that created the requests library but also allows you to do scraping of dynamic websites and parsing right away.

03:51 And then a very commonly-used tool for scraping dynamic websites is Selenium. There’s also a tutorial that you can check out on Real Python about working with Selenium for scraping dynamic content, but we are not going to go into this in this course.

04:07 So that’s out of the scope of this course, but I wanted to make you aware of what the problem here is, and that you might run into it if you go off on your own and you have this specific website that you want to scrape and you try applying the techniques that you’re learning here, but then you run into a problem like this—that the response that you’re getting actually does not contain the information that you want. The reason is very likely that this is a dynamic website.

04:30 You’re just getting JavaScript code and not the information that you want. Okay, and my suggestion for—if you’re getting into web scraping—is start off with static websites because they’re easier, and they’re going to help you get through this process—the inspect, scrape and parse—which is going to be the same also for dynamic websites.

04:50 You want to get very used to this process, understanding the HTML in there and how you can interact with it, and once you are familiar and comfortable with scraping static websites, then you can move on to exploring some of these tools and the links that I have here—the Notebook for you—for how to also scrape dynamic content.

05:09 Okay! And that wraps up the short overview and intro to dynamic websites, and also brings us to the recap for scraping that you’re going to see in the next lesson.

samsku1986 on Nov. 2, 2020

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://l42-harmony-01.video54.local/release_planners/969/results')

Getting this error while I try:

>>> r.html.render()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/users/home/ssugatha/.local/lib/python3.5/site-packages/requests_html.py", line 341, in render
    content, result = loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, scrolldown=scrolldown))
  File "/usr/local/lib/python3.5/asyncio/base_events.py", line 466, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.5/asyncio/futures.py", line 293, in result
    raise self._exception
  File "/usr/local/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/users/home/ssugatha/.local/lib/python3.5/site-packages/requests_html.py", line 310, in _async_render
    page = await browser.newPage()
AttributeError: 'coroutine' object has no attribute 'newPag

Martin Breuss RP Team on Nov. 2, 2020

Hi @samsku1986. This is probably because the content you are trying to load needs to be loaded asynchronously.

You could try again using an AsyncHTMLSession object instead:

requests.readthedocs.io/projects/requests-html/en/latest/#tutorial-usage

However, the page you are requesting doesn’t even load in my browser when I try to access it, so it might be related to that instead.

samsku1986 on Nov. 3, 2020

That page is not public. That is a local site.

Become a Member to join the conversation.