Your Turn: Build a Pipeline
00:00 Congratulations for making it all the way down here in the course. Now, you’ve went over the basics of the web scraping process, which is: start off inspecting your page, you go ahead and scrape the content, and then you go forward and parse that content, picking out the information that you want.
00:17
This is a process that you will repeat for every website that you’re working with and every website that you want to scrape. So, keep in mind that this is like the high-level steps that you need to take, and that the actual specific code that you need to do—and obviously, the inspecting and all of this—is going to be individual for each of the websites that you’re working with. Now, in order to give you some practice, you’re going to look some more at the indeed.com website, and I’ve assembled a couple of tasks for you which will help you to practice your web scraping skills some more. Now, in this part, I want you to combine your knowledge about the site, the requests
library, and Beautiful Soup that you’ve gone over in the past parts of this video course, and then tackle a couple of tasks.
01:00 So, the idea is that I want you to automate the scraping process across multiple results pages, because we are going to look at this in a second, but you’ve only looked at 10 or 15 results of the search, but there’s more to that.
01:15 You’re going to have to figure out how can you access multiple pages and get the results from all of those pages. Then, I want you to generalize your code for varying search opportunities.
01:24 Like, you want your code to be able to search not only for Python in New York, but maybe you want to also search for Go in Berlin, or whatever—just different locations and different search terms.
01:37 So, you want to write some functions to generalize the code and allow for different inputs. And then finally, in the parsing part, I want you to be able to target specific pieces of information—and I have some suggestions for which ones that could be—and then save that specific information out to a file so that you can also use it and maybe work forward with it.
01:56 Now, let’s look at this in the example start Notebook that I have for you.
02:03 And here we are in the Notebook. So again, you have the high-level tasks written out here, and then more specifically, the tasks that I’m asking you to do. Now, I want you to scrape the first 100 available search results, because—let’s look at this some more. Here’s the search results, but—I think we counted them at some point before, it’s 15 of those, but that’s it. And then there’s another page, so you can click forward and get more search results. Because it wasn’t just 15 results, but 3,414. Right?
02:34 So, you want to be able to not only collect the information from the first page, but also from the second page, et cetera.
02:41 You’re going to have to write some code to be able to do that, and I have a couple of hints for you in the Inspect part of what you can look at in order to figure out how to do that, and I have some questions that can help you get on the right track down here in the Inspect part.
02:56 Then, once you know how to get the 100 available search results, I want you to do also generalize the code so that you’re able to search for different locations and for different jobs.
03:07 Then, I want you to pick out specific information, which is the URL for applying to the job, the job title, and the job location. Finally, save the results of your search to a file. Now, this Notebook gives you a start for that.
03:21 It mainly just contains the questions that can get you on the right track, and then you can just go in here and start typing your code.
03:32 Make sure that you keep in mind the process that we talked about in this course. You want to inspect the page. Go over here, see what happens when you move to a different search result page.
03:45
So, what changes in the URL when you click that? Inspect. Use your developer tools to figure out what are the specific elements that you want. What is the id
, for example, or maybe a class name that defines where the location is noted on the page?
04:02
Et cetera. So, just keep in mind that you have these tools in your tool belt now. You know how to inspect the page, then you know how to use requests
to scrape it, and then you know some of the methods that Beautiful Soup provides to pick out specific pieces of information. Go through this process, keeping in mind these specific tasks, and try to tackle them to get a lot of additional training for doing web scraping and to create a script that actually does something that collects information that might be relevant for you, or maybe one of your friends who might be looking for a job. Okay.
04:35 So, there is a solution document to these tasks, but I suggest you to go for it by yourself, try it out, see what you can do, and always you can compare with the solution document, but you know, the process of learning really is about trying to figure out stuff by yourself.
04:51
Make sure that you check out the Beautiful Soup documentation, the requests
documentation, get familiar with reading the docs that will also help you down the line, and make this project your own! Look at what would be interesting for you to scrape from this job page and really target the pipeline that you build out to that specific wish of yours. Now, before I let you go, let’s go to the final video in this course where we will do a full course recap and summary because, you know, it’s all about the iterative nature of working on things, so let’s do a quick brush over everything that you went over in this course and everything that you learned. See you over there, in the final video.
Bartosz Zaczyński RP Team on May 13, 2021
@KatMac For some reason, Visual Studio Code doesn’t recognize your project root folder as the source folder and incorrectly marks all references to the names imported from imain_config.py
as problems to fix. However, the suggested change is doing more harm than good.
It’s hard to tell precisely why without seeing the full code. For what it’s worth, I’d probably ignore the false positive warnings.
KatMac on May 13, 2021
I found out what the problem was, VSC updated and Pylance was installed. Once I disabled that, everything is good again.
Linds787 on Sept. 19, 2024
Hi there! Sorry if it is obvious, but where is the solution document for this exercise? Thanks!
Become a Member to join the conversation.
KatMac on May 12, 2021
I am in the process of adding additional features to the Indeed scrape project. I decided to split the original file into 2. My two files are:
imain.py
imain_config.py
I have saved these 2 files to the following folder on my computer:
c://python-train/indeed
In the
imain.py
file, I added in the following line at the top:I am using VSC. When I run the
imain.py
file, everything works as expected, but in theimain.py
tab at the top far right I can see the number 9 displaying (which means I have 9 issues to resolve).In the VSC editor, it indicated that I needed to add in the following line instead of the one I used above:
When I made this change to the
imain.py
file, all my 9 issues were resolved, but I received this error when I ranimain.py
:My code no longer runs.
Have no idea how to fix this one!