Decipher the Information in URLs
In this video, we’re going to find out what information is hidden in the URLs and what you can do to extract the relevant pieces of information for you and use it for your web scraping. So, if you think back to the search results that we had before, when you look up here in the browser bar, you can see that this part—the highlighted part here that seems understandable—and then it comes into some sort of somewhat garbled-looking text, even though you can see
python in here, and there’s also something that relates to New York, which seems to relate to this search that we put in there.
00:37 But there’s a lot of somewhat cryptic information, right? So I want you to understand these URLs a bit more because it’s going to be a big asset for your web scraping skills.
00:49 So here, I have this URL that looks pretty similar to what we just saw over there in the browser bar. I’m going to pick it apart because it’s a bit difficult to understand if you look at it just like this.
01:01 We have two main parts that we’re going to look at. On the left side, I grouped all of these parts together as the base URL. You can dive deeper, and this consists of a couple of pieces again—like, you have a protocol here, a subdomain, then a domain, and then a part of a path as well.
01:18 Usually, what you’re concerned with as a normal user is the domain. That’s what you’re going to type into the browser bar to get to the page, and then usually, you navigate using the visual interface of the page.
But for everything that concerns us here, this is the base URL, and then we have something interesting happening here on the right side that is different. And it all starts off with this question mark (
01:42 You can think of this as a question, essentially, that you’re asking to the web app that sits over there on the server. It’s a query that you’re sending over there to get some response.
01:53 And that’s why this part is called the query parameters. This is still a little confusing, so I want to pick it apart for you and talk about each of the different pieces in here one by one. At first, we have the question mark, which starts the whole query. It consists of two query parameters, this specific query.
We have the first parameter that says
q=python, and then we have the second one that says
l=new+york. These two parameters are separated from each other with an ampersand symbol (
So again, we start the query as a question mark, then we have parameters that are in the format of a key-value pair, essentially, when you think of it from Python terms. Like, you have here the key, then you have an equal sign (
=), and then you have the value to it.
02:41 These belong together as a parameter, then you have a separator, and then another parameter. And this can keep going, so you can have another separator, another query parameter, et cetera, et cetera.
02:52 That’s how the information for the question that you’re sending to the server is encoded in the URL. Now, let’s head back to the browser and look at this again.
So up here, again, we have this base URL and then here, the query parameters. You can see we’re asking for
python, so it seems like the key for the question that you’re asking—and that’s a pretty common one, this
python, which relates to this What are you looking for, then we have the ampersand (
&), the divider between the different query parameters, and then an
l here—that’s a bit difficult to see that it’s an
l, but this is L for location—and then
york, it’s the location.
03:32 And then here’s another query parameter, and I don’t actually know what this is about, but it’s probably related to which of the boxes are clicked. Let’s see—if I click this, you can see that this hash here changes while the rest stays the same.
03:47 So, this third part here is another query parameter and that just relates to which of the boxes are currently clicked. I can get rid of this and reload it, and we’re going to be at the top of the page and the first one automatically selects. Okay. So, you can play around with this somewhat.
04:05 You can go ahead and say, “What if I changed this?” So, I’m going to look not for Python jobs, but let’s say I’m looking for “software developer”…
and you can see that up here, the query parameter changes. The
q does not point to
python anymore, but it points to
york is still the same. If I change this, it’s going to be something else as well. Now, what happens if you just change it up here? I could go in here and say, “I am adventurous, and I’m looking for a
04:41 And you can see it that also the website changes! So, whether you type your query in here or you type it up there in the query parameters is the same for the page, because all you’re doing is sending this question to the server and then the server returns as a response these job postings.
04:58 And that is how the query parameters relate often to information that you can put in manually to the page—but not only by typing something specific as it happened over here, because it can also be a click. You see, like, the third query parameter that we have up here—it’s not something that I type anywhere, but the user input that I’m doing is by just clicking on one of the boxes.
05:20 This is how this changes. And you can also see that since this is a hash, it’s not really something that you as a user should be concerned about. And as a matter of fact, usually, when you just use a website normally you don’t deal with these query parameters directly, but you deal with the user interface. But being aware of this connection—that when I change something in here…
05:42 and submit these changes, then the query parameters up here change—is a really important connection to make in your mind because this is going to help you a lot when you’re scraping the information from the page, because the way that you’re going to be able to interact with the server is through queries that you’re sending in the URL. Okay.
So, keep this connection in mind and understand that you have this question mark (
?) that starts off the query parameters, then key-value type pairs of information separated by ampersands (
&), and that this is a way that it can send information to the server and receive your responses back from there. And that’s all for this lesson.
06:22 See you in the next one, when we dive even deeper and look at this page with our developer tools.
Become a Member to join the conversation.