Grabbing Your Data

Name: Grabbing Your Data
Uploaded: 2025-12-02T14:00:00+00:00
Duration: 4 min 42 s
Description: Now that you have everything installed, it’s time to get cracking. So the first thing you need is some data. For that, you’re going to open the Wikipedia page that has a list of countries and dependencies by population. And the link for that page…

Introduction to pandas Rodrigo Girão Serrão 04:42

Resource mentioned in this lesson: List of countries and dependencies by population

00:00 Now that you have everything installed, it’s time to get cracking. So the first thing you need is some data. For that, you’re going to open the Wikipedia page that has a list of countries and dependencies by population.

00:12 And the link for that page can be found under this video, so you don’t have to type this out by hand. Once you open the link, you should see a page that looks like this.

00:23 Now, if you scroll down, you will find a table that includes an ordered list with countries and dependencies and their population. And what you’re going to do in this project is figure out what’s the average size of a country.

00:38 So you can see there’s countries here with more than 1 billion inhabitants. And if you keep scrolling down, you’ll see that number go down and down and down until you get to countries that have fewer than 1 million inhabitants.

00:50 So what’s the average size of a country? That’s what you’re going to figure out. And for that, you’re going to do two steps. You’re going to download that page, and you’re going to turn it into a format that pandas can use so that pandas can extract the table with the data from that Wikipedia page.

01:09 So go ahead and open your terminal,

01:12 and once you’re in your terminal, run the command jupyter notebook, so jupyter space notebook. So you can start the notebook interface.

01:23 Once you go, you’ll see some output and it should open a tab in your default browser.

01:37 Once you are in the Jupyter Notebook interface, go ahead and click File, New Notebook, so that you create a new notebook in which you can work.

01:48 And now that you’re in your notebook, you’re going to grab the data from the internet. And for that, you’re going to use the module url from the Python Standard Library.

01:58 In order to be able to fetch the data, you’re going to need to grab the URL of the Wikipedia page and save it in the url variable. And then you’re going to create a request object.

02:09 And this request object represents the page that you want to grab and some information you’re going to pass to the server, to the Wikipedia server. And in this case, you just need to say that the user agent is something. You need to set it to something, and an empty string will do.

02:25 And once you have your request object, you’re going to get your response by using urllib.request.urlopen() and you pass it the request object.

02:34 So this represents the step of getting data. Every data project starts with getting some data. It might be from an API, database, or in this case from the internet.

02:45 So you’re going to run this, and then you’re going to print the response code, which you’re hoping is 200. Once you get the response code, you’re going to use pandas to turn this response into something pandas can work with.

02:58 So we’re going to go ahead and import pandas. And the first rule of pandas is that you abbreviate it as pd. Now, when I say rule, I really just mean convention since when you’re using pandas, you’re going to type it out so much that it ends up being worth abbreviating it as pd.

03:16 But it’s just a convention that you will find when you read other people’s code. And once you have it, you are going to create a variable called tables, which comes from reading the HTML.

03:28 So pd.read_html() and then you use response.read().

03:34 And now you run this. And you can’t see it, but in this tab, you used lxml, the extra dependency you installed previously. Without it, read_html() does not work. Now if you print tables, it’s going to be a Python list with pandas objects. pandas went through the page, through the Wikipedia page, and grabbed all of the tables it could find.

03:56 And in your project, you just want the very first one. So you can say data = tables[0].

04:04 And now you can take a look at data.

04:07 And as you can see, the very first rows coincide with what you could see in the Wikipedia page.

04:13 So this was the step of grabbing your data. In case you’re having any issues with requesting this page from the internet and extracting the data from the internet and from the Wikipedia page, in the next lesson, you’re going to learn how to work with data locally.

04:28 So you will also have access to this data as a file you can download from the Real Python interface in case you’re having any issues with the Wikipedia page.

04:39 So stay tuned for that.

Niranjan Marathe on Feb. 5, 2026

Not able to complete this tutorial as I can not download because the pd.read_html(response.read())

fails with lxml errors. I to delete my .venv and reinstall everything..but same result.

Please help.

Martin Breuss RP Team on Feb. 6, 2026

Hi @Niranjan Marathe, sorry to hear you’re running into trouble! The lxml errors usually mean that lxml didn’t install correctly. This can happen on some systems when the underlying C libraries are missing.

A few things to try:

Make sure lxml is installed in the same environment you’re running the notebook from:

$ python -m pip install lxml

If that doesn’t work, you may need to install system-level dependencies first.

On macOS:

$ brew install libxml2 libxslt

On Ubuntu/Debian:

$ sudo apt install libxml2-dev libxslt-dev

On Windows python -m pip install lxml usually includes pre-built binaries, but if it fails, try upgrading pip first (python -m pip install --upgrade pip) and then reinstalling lxml.

If the issue persists, the next lesson shows you how to work with the data from a local file instead, so you can keep going without needing to fetch from Wikipedia. Look for the download link in the Supporting Material dropdown.

Hope that helps!

Niranjan Marathe on Feb. 8, 2026

Thanks for the reply. It was something with my local env. The same code works on the google colab notebooks.

Reynault on March 6, 2026

Hey,

I ran into the same problem, and the insights I received were that pandas may not handle raw HTML strings directly (since pandas 1.5+). The solution would be to wrap the response:

tables = pd.read_html(StringIO(response.read().decode('utf-8')))

Instead of:

tables = pd.read_html(response.read())

StringIO being unable to read bytes, decoding is essential.

Quick note: I assume I ran into the same error, as the files mentioned in the Exception (OSError btw) are all lxml files, but the problem was not related directly to lxml.

Works like a charm with this tweak!

Become a Member to join the conversation.