Loading video player…

Introduction to Web Scraping

00:00 You’ll start this part off by learning about what is web scraping, then we’re going to learn about why would you even want to scrape the web in the first place, talk about a couple of challenges of web scraping—specifically, variety of websites and how long your code can last—and then finally, talk about APIs, which represent a often useful alternative to web scraping.

00:20 Let’s get started talking about what is web scraping in the first place.

00:24 So, you’ve probably heard this term before but maybe you’re not entirely sure what it means. Generally, it could be any type of gathering information from the internet.

00:34 So, just pulling information from the web, whether it’s you doing it manually going to your favorite song lyrics page and just copy-pasting things from there into a local TXT file or whatever—all of this would be web scraping, but generally, when someone talks about web scraping they mean the automated gathering of information from the web.

00:54 So that’s what web scraping is about: writing some code that fetches information from the internet. Now, why would you want to scrape the web? You can think about maybe the song lyrics example that I mentioned before.

01:07 Maybe you want all of the song lyrics from a specific album but you don’t want to keep clicking around and copy-pasting, so a way to do that would be to automate it and just pull all the information with a script. In this course, we’re going to talk about your job search, which is—there’s a bunch of these job search aggregator tools out there that help you to look for a position, but again, it involves a lot of looking at this little card, the information of the job, clicking it, reading over it, finding does this even interest you, et cetera, et cetera. And there’s some ways that you can do to automate this job search.

01:42 Now, as a disclaimer, you’re not going to be able to totally automate this job search with the information that you’re going to learn in this course, even though I’m sure there’s some people who already did things like that as well, but we’re going to use this as an example to learn about web scraping on a more general basis so that you can apply it for whatever task you’re interested in for any gathering of information from the internet. Now, specifically in this course project, we’re going to talk about the web scraping process and tools that you can use for web scraping, and this is the main focus.

02:14 I want to introduce you to the different processes and the tools that are important for this, and we’re going to do it by automating the process of gathering some information from a job board. Specifically, it’s going to be indeed.com.

02:27 And then in the end, I’m going to show you some pointers how you can customize this code for your personal job search and maybe build it out in a way so that it’s actually a useful tool for you to maybe make your job search a little bit easier. Okay!

02:40 Now that you have an idea of what web scraping is and why you would want to scrape the web, in the next lesson, we’re going to talk about what are some challenges of web scraping.

02:50 See you there!

Avatar image for Doug Ouverson

Doug Ouverson on Oct. 19, 2020

Hello Martin,

Thank you for quality course on web scraping.

I have a question.

I noticed that the following code

r = get_job_listings('python', 'new york', 100)

Produced this type of output:

    {'title': 'RPA Developer Virtual Hiring Event', 'link': 'https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0Btxs39KmTzjw_u_hUXcyTcLpNeUj18C2Nw5A7DCW0FWLuE0AJWyt3e1YBIDpJ_q7BwERnbmrP4c6IMGQV9-vTVQAA7CWl_XQz7FjQtAs2sOchSezu5rbcXPCvXqSDHC2mML7imJGSjXxJ1zh9NzmRuSNWs7jhGcYOOeDUYK3N7d6-AyR0976jT9ClmjGUUJcl1iOgXZpxxVwlFQfPQX4AD4sIKMYjYwB1Sqb0FkKBh4uIbPMZ5gWaasURmvt-fx0-p4VG0cZV8vBJOA6yYESXkpkVRJZPXjfXj47cikA1GnIRGir9LiLg3ZRsxH9q7oQJkFHlwsvE4wv3zW1t0Wmf5555h7c7zGodzwvfeyCDOW4IBrKqVtJeNgiIANmpqbAwOSouwfI9ae1MCi8ivF-oftrN_NkUyOLeoU54mczwoxow7k6DP3OibneYew1sxqiFxhkQGMUf6sr4IQaRpwLrk&p=0&fvj=0&vjs=3', 'location': 'Syracuse, NY 13201 (Southwest area)'},

    {'title': 'Software Engineer, Recent Graduate', 'link': 'https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0A9TGvvMbs_9tzUmxh-6hF76myvpV3Plf7hsqRV14VZ2cfT-rsmf5GVDlSzkmHYntdFJ9AKPL4HCqWJeMQ9CW0zlqqJceB4K4dKwlT9bu-lCJlEx20Me_hwS3sfBstsn3v3O5spw72DLro9f3QncO0VvQR2tI7207UvXDkVqmvlRh1Gi32s4SI0ZY_EHqe3kJDB2ssC5AeltUNHv-kz6PfG2zOS8wFYnBMLhL1MOxdE-bTE6AQYjG-wSBPvr6dzo6jSY8OfVBkJLRnb98PWyUWYk5BaTQsv6E0jBA6Igwx5zgPhCCUe12e-x1nVSoJ4t1TFfSTU0Go590I9RRH1uicai6Xn0D4XGbYg2YGwWF-GKFcGP9dGdZIsYjJ9cZ8vo9U=&p=1&fvj=0&vjs=3', 'location': 'New York, NY 10012 (Little Italy area)'}

Is this JSON? And if so, where in the program did you specify this format?

Kind regards,

Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Oct. 20, 2020

@Doug Ouverson Technically, it’s a Python list comprised of dict objects, but you can quickly serialize it to a JSON string using the built-in json module in the following way:

>>> import json
>>> json.dumps(r)
'[{"title": "RPA Developer (...)]'

The reason it’s a list is that you defined one in the get_job_listings() function and then returned it:

def get_job_listings(title, location, amount=100):
    results = list()
    for page in range(amount//10):
        site = get_jobs(title, location, page=page)
        soup = BeautifulSoup(site.content)
        page_results = parse_info(soup)
        results += page_results
    return results
Avatar image for Martin Breuss

Martin Breuss RP Team on Oct. 20, 2020

Exactly what @Bartosz wrote :)

And if you look a bit more you’ll also see why there are dictionaries inside of that list. That’s defined in the parse_info() function.

It might be a little more tricky to see, since the dictionary is created right inside the .append() method call, but you can see the same structure there that you got in your results:

{'title': title, 'link': link, 'location': location}

Hope that makes sense and glad you enjoyed the course @Doug Ouverson! :)

Become a Member to join the conversation.