Find Elements by ID

Web Scraping With Beautiful Soup and Python Martin Breuss 02:53

Transcript
Discussion (7)

00:00 Now to the actual step of finding an element by ID. Let’s head back over to the Notebook.

00:07 So, we’ve seen before that the response is all of this content, essentially, and we haven’t sifted through anything yet, but now the first step we’re going to do is find an element by ID. Now, what was the ID? Remember that this is an iterative process—this exploring of what is the information that you want to get from the site, of the content. And also just like figuring out where is it located.

00:32 And this is where your knowledge of the page comes in very handy.

00:36 This is why you explored it before. You already know that there’s some sort of column here that contains all that information, and this has an ID, and that’s exactly what we’re looking for now.

00:48 Keep in mind that this is always available to you. You can keep coming back to your browser and just inspect using the developer tools to figure out what is the exact thing that you’re looking for. Right?

00:58 So, we did this looking for it, see if I can find it like this. Yeah. So, remember? We found this <td> with the id="resultsCol" (results column).

01:11 So, this is going to be the column that contains all the information that you’re currently looking for, and this is the ID that this <td> HTML element has.

01:20 So, I can go ahead and just say, okay, "resultsCol"—this is what I want to pick out. And on the soup object, you have a method called .find() that you can pass in an id and then the name of the id. So, in this case, this is 'resultsCol', and you can just run this, execute this code.

01:39 And then the result is going to be not the full HTML from before anymore. So if you look at this, this has everything—the <head>, a bunch of JavaScript that’s included here as well.

01:51 So, this has all of the content of the page,

01:54 but here we sifted through it and picked out only the <td> element with the id "resultsCol" and everything that it contains.

02:03 So this is similar to here, when I expand this—everything that’s inside of this HTML element is what we now have access to under the name results.

02:16 So, go ahead and try this out some more. Look for another element in here—here’s one, for example, with an id—and just locate the element by ID. Practice this a bit.

02:30 You can make additional cells over in the Jupyter Notebook here and just run some code so that you understand what the syntax is to find an element by ID.

02:41 And since this is still pretty big, we want to drill down a bit further inside of this results container. We’re going to do that by class name in the next lesson.

iqbalamo93 on March 27, 2021

Hi Martin, I am trying your code. It’s not working. Is it like this URL/site is now dynamic?

Bartosz Zaczyński RP Team on March 29, 2021

@iqbalamo93 Can you define “not working” more specifically or reveal your error message if there is one, please? Martin’s sample code, which is attached to this video course, is working flawlessly for me. It fetches data from the indeed.com website:

{'title': 'Strategic Initiatives Analyst',
 'link': 'https://www.indeed.com/rc/clk?jk=2ab0ea1ac8e42854&fccid=1a6403f5a8617d71&vjs=3',
 'location': 'New York State'}

However, the written version of this tutorial relies on a different website, which apparently changed its layout since the tutorial was written. Therefore, it might not work anymore, and you may need to tweak your web scraping code.

iqbalamo93 on March 29, 2021

Hi Bartosz, Thanks for your reply, Apologies question being vague in nature, i tried to edit this but missed. Yes, it was indeed written version that i am reffering to , i thought they have moved to dynamic structure, but as you pointed out it’s layout that have been modified and code needs tweaking, Will be on that.

Thanks Again,

RedRegal on Jan. 22, 2023

Hi, I tried running this code with a different URL (https://uk.indeed.com/jobs?q=python&l=London). The HTML is that’s returned doesn’t represent the layout of the site.

<html lang="en-US">
<head>
<title>Access denied</title>

I tried again with the URL in the example, same site, same issue. Going down the rabbit hole of solutions and everything points to Selenium. Is this a case of the site being updated with new restrictions?

Bartosz Zaczyński RP Team on Jan. 23, 2023

RedRegal Unfortunately, indeed.com implemented new measures to prevent scipts like that from scraping its valuable content. Thanks for flagging it! We’ll look into the possible workarounds. In the meantime, you can try a different website.

Martin Breuss RP Team on Jan. 24, 2023

@RedRegal yes unfortunately, like Bartosz mentioned, scraping the content like shown in the course doesn’t work anymore on indeed.com as of 2023.

I’ve recorded an update that explains the situation, but we haven’t yet published it on the course.

However, the take-away of that lesson is that you can still apply all the conceptual learning from this course on a different website (as long as it’s static).

My suggestion is to practice with the Fake Python Job Board that I built as a safety for the written tutorial and that we’re self-hosting, so that changing site structures won’t end up presenting a problem again.

Read more about it in the written tutorial. Hope that makes sense and is helpful, happy to discuss it further :)

RedRegal on Jan. 24, 2023

Thanks Bartosz Zaczyński, I got it working with selenium webdriver

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Chrome('/Applications/chromedriver_mac64/chromedriver')
browser.delete_all_cookies()
browser.get("https://uk.indeed.com/jobs?q=python&l=London")
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}

soup = BeautifulSoup(browser.page_source, "html.parser")
result = soup.find(id="mosaic-jobResults")

print(result)

Become a Member to join the conversation.