Introduction to Web Scraping With Python (Overview)
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
The Internet hosts perhaps the greatest source of information on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.
In this video course, you’ll learn how to:
- Parse website data using string methods and regular expressions
- Parse website data using an HTML parser
- Interact with forms and other website components
00:00 Welcome to this practical introduction to web scraping using Beautiful Soup. You may have heard the name before, but it is a commonly used library for web scraping tasks using Python, especially if you’re getting started and working on smaller scale projects.
00:16 Perfect for a practical introduction. In this course, you’ll start by understanding some relevant terms about web scraping and the library you’ll be working with, as well as giving a quick brush up of basic HTML structure, and then we’ll start set up your environment by installing Beautiful Soup, create a Beautiful Soup object, and start working with that.
00:38 You’ll learn how you can navigate the Parse Tree using this library, how you can search the Parse Tree to identify specific information and even how you can modify it so that you would write back out a different HTML document than you initially received.
00:54 You’ll also get an overview how you can handle common web scraping challenges, and finally talk about what you should keep in mind to scrape ethically and which best practices you can follow.
01:05 I will also have some additional resources for you at the end. Those will be resources on the Real Python site that you can use to deepen your knowledge about web scraping if this topic interests you and walking through this course gave you a nice idea of what it’s all about.
01:22 This course is based on a tutorial on the site that also covers a couple of additional topics that I won’t be talking about in this course. Specifically, these are using string methods for text extraction, which is really tiring, using regular expressions for text extraction, which is powerful, but can be kind of confusing.
01:42 I’m cutting out these two points, which are interesting for context, but they’re practically not very useful because high-level libraries such as Beautiful Soup exist for you to make these tasks easier.
01:53 So if you’re curious, go ahead and read about those in the tutorial. The tutorial also covers how you can interact with HTML forms, such as submitting a form or filling it using another library called MechanicalSoup that builds on top of Beautiful Soup.
02:08 This is actually pretty cool, so I would suggest you to check it out after you’re finished with this course. You’ll also learn how you can interact with websites in real time using that same library called MechanicalSoup, also quite cool. So take a look at that when you’re done with the course.
02:24 The tutorial is called “A Practical Introduction to Web Scraping with Python”, and if you’ve downloaded the slides, you can also just click on the title here to navigate there.
02:32 You’ll see it again in the additional resources at the end.
02:36 And with that out of the way, are you hungry? Let’s have some beautiful soup together.
Become a Member to join the conversation.