Libraries linked in this lesson:
Real Python Web Scraping Resources:
Libraries linked in this lesson:
Real Python Web Scraping Resources:
00:00 In the previous lesson, I talked about data analysis and visualization. In this lesson, I’ll talk about crawling and scraping the web.
00:08 In the last lesson, I talked about doing analysis on data, but just where do you get that data from? Well, there are sites where you can download common formats like CSV.
00:18 There are sites that have APIs, and then there’s the third option—grabbing content from web pages with no formal interface. Crawling is the act of programmatically traversing multiple pages on a site, while scraping is the act of getting content from those pages.
00:35 To do this, you need to be able to parse the contents of the page and separate the HTML from the data that you actually want. To scrape a web page, you need to visit it.
00:46
The standard library provides the urllib
module to do URL manipulation and site visiting.
00:53
Although the standard library does have features for visiting web pages, its interface could be a little simpler. The defacto library to do this kind of work is called requests
.
01:02 This library is so popular that there was talk for a while about pulling it into the standard library. They decided against this so that it could be updated at a different frequency than the yearly cadence used for the CPython interpreter.
01:17 Once you visited the page and grabbed its HTML, you’ve got to process it to pull your data out of it. Beautiful Soup is an excellent HTML and XML parsing library, and one of its specialties is dealing with broken HTML.
01:30 You’d be surprised just how much of the web isn’t actually compliant with the standards. Your browser does a lot of heavy lifting for people who have written badly-formed pages.
01:40 If you’re going to be crawling and scraping multiple sites, the Scrapy library is relatively simple to use and quite powerful. You can have it visit all the pages in a particular site with just a few lines of code.
01:54 These next two aren’t technically Python. They’re both browser automation tools, meaning you can use a script to make a web browser act like someone is using it.
02:03 Both Selenium and Playwright have Python bindings, so the script you write can be in your favorite language. Both tools can be used to do testing when you’re writing web pages or just to scrape content.
02:15 Since these both actually use a browser, they can interact with pages that are heavy on JavaScript, like single-page applications.
02:25 For a list of web scraping tutorials, see this topic, or for a step-by-step journey, follow this learning path. This tutorial shows you how to use Beautiful Soup.
02:36 I just love the name of that library, and this project shows you how to scrape a site to get pricing information on Bitcoin data. If you become obscenely rich, don’t forget to tip your video host.
02:48 That’s it for this second section. Next up, embedded systems and robotics.
Become a Member to join the conversation.