Web Scraping With Beautiful Soup and Python (Summary)
In this course, you’ve learned how to scrape data from the Web using Python,
requests, and Beautiful Soup. You built a script that fetches job postings from the Internet and went through the full web scraping process from start to finish.
You learned how to:
- Inspect the HTML structure of your target site with your browser’s developer tools
- Gain insight into how to decipher the data encoded in URLs
- Download the page’s HTML content using Python’s
- Parse the downloaded HTML with Beautiful Soup to extract relevant information
Beautiful Soup is packed with useful functionality to parse HTML data. It’s a trusted and helpful companion for your web scraping adventures. Its documentation is comprehensive and relatively user-friendly to get started with. You’ll find that Beautiful Soup will cater to most of your parsing needs, from navigating to advanced searching through the results.
Congratulations, you made it to the end of the course! What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment in the discussion section and let us know.
00:15 And also, you’ve been heavily making use of your browser and specific tools inside of your browser. Now, let’s start off by talking about the first section of this course, about the intro to web scraping, where you learned what is web scraping in the first place, what are some common problems with it, and why would you want to use it in the first place. First of all, you learned that web scraping means gathering information from the web, and that it generally means doing this in an automated manner.
00:42 Like, usually, it’s not about clicking a lot and copy and pasting, but instead writing some code that does the scraping for you. Next, you learned about what are some common challenges of web scraping, and there’s two of them that you learned about in this course—the first one being variety, meaning that every website is specific, it’s got its own structure, and if you want to scrape it, you’re going to have to know what the structure is like so that you’re able to address the specific pieces of information that you want.
01:12 That just means every web scraping code you write is going to be specific to one website, and maybe you can reuse some parts of it, but generally every website is different, so also all of your code is going to have to be different.
01:25 The second problem that you learned about is durability of web scrapers, and the reason for why they are usually not very durable is that everything constantly changes on the web and once the structure of a website that you’re working with changes, you will also have to change the code of your web scraper.
01:55 It generally gives you back that data in a different format—for example, JSON or XML—and these formats are designed to be easily workable with a program. So it’s not designed to be looked at, but it’s designed to be consumed by a program. As opposed to that, usually when you’re accessing a page with your browser, then what you get back is HTML, which is designed to be viewed by a human eye.
02:29 You also learned that APIs are not what we’re dealing with in this course, but instead—because many websites do not provide APIs, but just have this HTML that they’re sending back—the web scraping that you’re working with in this course handles HTML and how to get this soup of strings and information and pick out the pieces that you’re interested in. And with that, you already moved on to Part 1, which was about inspecting your data source. The tools that you used for this were your eyes, your browser, and your developer tools.
03:01 We started off by exploring the website, so that’s really just going to the website like any normal user would and clicking around to figure out what happens, what is possible to do, which information is on the website.
03:13 So, that gives you a first look into what you might be interested in scraping from that website. Then, you went ahead and learned about URLs and what information you can gather from looking at URLs. So while you’re clicking through the website, you can see the URLs change, and there’s a bunch of information that can be stored in query parameters at the end—you learned about how to decipher this information. Next, you dove even deeper by using your browser’s developer tools and just inspecting specific elements of the page and seeing how they are represented in the Document Object Model that lies at the bottom of the page and represents its structure and is more or less equivalent with the HTML that you’re getting back.
After getting a good understanding of what does your website look like and what’s the data you’re working with, you moved on to Part 2, which is about scraping the HTML content from a page—so, getting the information into your code so that you can work with it programmatically. Now for this, in this course, we’re using Python and the
requests library, which is a very convenient way.
04:19 It represents an easy interface for getting code from the internet. And the website that you’re working with here, the indeed.com website, is a static website which means it gives you HTML back from the server directly, which makes scraping it much easier than some other examples.
We also looked at some other examples later, but let’s start off with the static website. Here, you can see some code and you can see it’s very concise because
requests has such a nice interface that doesn’t need a lot of lines of code to get the HTML back into your code. This is all you need—three lines of code—and this gives you the HTML to work with. Now, there’s other situations that you might run into—for example, you might want to scrape information that is hidden behind a login, and you would need to authenticate first.
You learned that this is possible with the
requests is not the right tool if you are running into a situation like that, and you also learned that there are tools out there for this, and one example is Selenium. And so if you encounter a website like this, you know which direction you should go forward and look at, but we didn’t go into this in this course.
05:52 After you have the information in your script, you’ve scraped the page. Now, it’s time to parse the HTML in order to pick out the pieces of information that you’re interested in. And in this course, we did that using Python and the Beautiful Soup library.
06:09 Now, I’m not going to go again over the code snippets here, but you learned about how you would find elements by a specific ID, how you could find them by an HTML class name, or also how you can extract the text from an HTML element, as well as how you could extract the attributes. In our example, this was the URL from a link element, from an HTML element.
06:32 What I want to stress in this part is that it really is a very iterative process and that you should always just keep in mind that you can switch between your browser and your developer tools to inspect a bit more what is the exact item that you need—where you want to get the text from, for example—and then switch back to your code and write the necessary pieces, run it, see what comes out there, and keep refining it by switching between these two different tools that you have. And in this final section, I introduced you to a Jupyter Notebook with a bunch of tasks for you that lead you forward in working more with the indeed.com website in order to gather more practice doing web scraping—both the scraping aspect, as well as the parsing aspect, and the inspecting—all of the parts that are important.
07:23 I hope that this is going to be interesting for you to work on personally, and you just pick out the pieces that you’re interested in and just train your web scraping muscles more so that you get familiar with working with these tools, and then move on to scrape whatever you’re interested in scraping from the web. Now, as a quick recap of the recap, essentially, I just want to say, keep in mind the iterative web scraping process, which is start off inspecting, scraping, and parsing. It consists of these three pieces.
07:53 The inspect part is an important one. You need to understand your website in order to be able to work with it, in order to be able to scrape and parse it. And for this, I want to stress that every page is special, so you really just need to get in there and understand first what is the structure, what is the information you want, and how are you going to be able to get it.
08:14 And with this, I want to say congratulations again for making it all the way through the course. I hope you enjoyed it and that you learned something about web scraping and that you know how to work forward and get some more practice. Keep in mind that it’s all about training your skills, and if you keep training that, you’re going to get good at web scraping.
Become a Member to join the conversation.