Resource mentioned in this lesson: List of countries and dependencies by population
Grabbing Your Data
00:00 Now that you have everything installed, it’s time to get cracking. So the first thing you need is some data. For that, you’re going to open the Wikipedia page that has a list of countries and dependencies by population.
00:12 And the link for that page can be found under this video, so you don’t have to type this out by hand. Once you open the link, you should see a page that looks like this.
00:23 Now, if you scroll down, you will find a table that includes an ordered list with countries and dependencies and their population. And what you’re going to do in this project is figure out what’s the average size of a country.
00:38 So you can see there’s countries here with more than 1 billion inhabitants. And if you keep scrolling down, you’ll see that number go down and down and down until you get to countries that have fewer than 1 million inhabitants.
00:50 So what’s the average size of a country? That’s what you’re going to figure out. And for that, you’re going to do two steps. You’re going to download that page, and you’re going to turn it into a format that pandas can use so that pandas can extract the table with the data from that Wikipedia page.
01:09 So go ahead and open your terminal,
01:12
and once you’re in your terminal, run the command jupyter notebook, so jupyter space notebook. So you can start the notebook interface.
01:23 Once you go, you’ll see some output and it should open a tab in your default browser.
01:37 Once you are in the Jupyter Notebook interface, go ahead and click File, New Notebook, so that you create a new notebook in which you can work.
01:48
And now that you’re in your notebook, you’re going to grab the data from the internet. And for that, you’re going to use the module url from the Python Standard Library.
01:58
In order to be able to fetch the data, you’re going to need to grab the URL of the Wikipedia page and save it in the url variable. And then you’re going to create a request object.
02:09 And this request object represents the page that you want to grab and some information you’re going to pass to the server, to the Wikipedia server. And in this case, you just need to say that the user agent is something. You need to set it to something, and an empty string will do.
02:25
And once you have your request object, you’re going to get your response by using urllib.request.urlopen() and you pass it the request object.
02:34 So this represents the step of getting data. Every data project starts with getting some data. It might be from an API, database, or in this case from the internet.
02:45 So you’re going to run this, and then you’re going to print the response code, which you’re hoping is 200. Once you get the response code, you’re going to use pandas to turn this response into something pandas can work with.
02:58
So we’re going to go ahead and import pandas. And the first rule of pandas is that you abbreviate it as pd. Now, when I say rule, I really just mean convention since when you’re using pandas, you’re going to type it out so much that it ends up being worth abbreviating it as pd.
03:16
But it’s just a convention that you will find when you read other people’s code. And once you have it, you are going to create a variable called tables, which comes from reading the HTML.
03:28
So pd.read_html() and then you use response.read().
03:34
And now you run this. And you can’t see it, but in this tab, you used lxml, the extra dependency you installed previously. Without it, read_html() does not work. Now if you print tables, it’s going to be a Python list with pandas objects. pandas went through the page, through the Wikipedia page, and grabbed all of the tables it could find.
03:56
And in your project, you just want the very first one. So you can say data = tables[0].
04:04
And now you can take a look at data.
04:07 And as you can see, the very first rows coincide with what you could see in the Wikipedia page.
04:13 So this was the step of grabbing your data. In case you’re having any issues with requesting this page from the internet and extracting the data from the internet and from the Wikipedia page, in the next lesson, you’re going to learn how to work with data locally.
04:28 So you will also have access to this data as a file you can download from the Real Python interface in case you’re having any issues with the Wikipedia page.
Become a Member to join the conversation.
