Calculating the Average Population by Country
00:00 It is now time to calculate the average population. Let’s find the size in number of inhabitants of an average country. And in order to do that, you need to be able to manipulate your data.
00:13 You have your data in a DataFrame, but how do you perform computations on it? And that’s what you’ll learn in this lesson.
00:19 Go ahead and open your notebook
00:23
and make sure you read your data from the CSV file and you have it in your variable data, which you now know is the DataFrame.
00:31 In pandas, the way you usually manipulate your data is by using different methods that DataFrames offer. Now there’s hundreds of these methods, so it’s not your job to memorize all of them.
00:44 Through practice and through experience, you will start to understand what’s available to you. You’ll start to remember methods you already used and you build up your vocabulary, your pandas vocabulary, through practice and through experimentation and through playing around in different projects.
01:01 For this lesson, we’re just going to use a couple of simple methods that pandas offers. The first thing you can maybe learn is that if you have a DataFrame, you can use square brackets to access a specific column.
01:15
So by typing data["Population"] you have access to the column with just the population numbers. And once you have that, there’s a method that’s very appropriate for what you want to compute, which is called .mean().
01:33
So you can type .mean(), you can open and close the parentheses. Actually, let me go ahead and copy this so we have both. And you go with .mean().
01:44 So you can see the progression of what you’re doing.
01:48 Once you compute the mean, then this is going to go through that column and compute the average of all of those numbers. And your result is roughly 67 million.
01:59 So an average country or the average size of a country in terms of inhabitants is 67 million, which I find astounding. I’m Portuguese. My country doesn’t even have 11 million.
02:10 So knowing that on average a country has 67 million, that’s pretty impressive. But hey, if you’re paying attention, you might notice something, there’s something fishy going on.
02:19 And this links back to the lesson about understanding your data. This is a very important thing when you’re working with data. You need to make sure you’re computing the right things with the right data.
02:31
If you look at the column Population, the very first value is over 8 billion. But there’s no country with 8 billion inhabitants. This is the first row that represents the whole world.
02:43 If you scroll up, you can see that the very first row has information about the whole world, which is not a country, it’s all of the countries combined. And this is affecting your calculation.
02:54 So you have to get rid of this value. So how do you get rid of this value? With another method that pandas gives you access to. The first thing you need to do is you need to look at this row and look at its index, which is on the left.
03:06
It’s the 0. So this index is very important because now what you’ll do is you’ll access the column again with ["Population"],
03:18
but now you’re going to use the method .drop() and you’re passing the index of the row you want to drop. And if you check this column, you can see that it now starts with the first country.
03:29
It now starts with India, and it no longer contains the value for the whole world. So now you can compute the mean again. You’ve got .mean() after dropping that row and now you see that the average country turns out to be 33 million strong.
03:46 So the average size of a country is not 67 million, but 33 million, which is less than half of what you have before. So now this might make you think, is the mean really a reasonable thing to compute here?
04:00 It was so sensitive to a huge country, which turned out not to be a country. Maybe it’s best to compute the median because the median will tell you that half of the countries in the world have more than that number of inhabitants, and the other half has less than that number of inhabitants.
04:17
And how do you compute the median? Well, it’s the same thing. You can even copy and paste because it really is the same thing. But instead of using the method .mean(), you use the method .median(), and now you get a very different number.
04:30 Now you get five and a half million, which is a much smaller number. So you know that half of the countries in the world have more than five and a half million inhabitants.
04:39 And in hindsight, you can even check that the median was a much more appropriate thing to compute because if you try to compute the median and if you have forgotten to get rid of the row that contains the whole data, you would see that the result would still be very, very similar.
04:55
Still roughly 5.5 million. So this median is much more robust to outliers. And this isn’t really pandas knowledge per se. I mean, it’s useful to know that there are two different methods, one called .median() and another one called .mean().
05:09 But this is again to link back to the lesson about you need to know your data. You need to check if you’re computing the right things with the right data.
05:18
Now in front of you, you have three small expressions. They’re all very similar, and they all use square brackets to access columns. Some of them use the method .drop() to drop some rows.
05:29
And then you’re using either the method .mean() to compute an average or the method .median(). And the important thing here is to realize is that a very common way to work with DataFrames in pandas is by doing exactly what you’ve been doing so far, this chaining of different methods.
05:45 So you want to chain different methods so that gradually you build the expression that computes the result you care about. You might need to shuffle some things around, reorder some rows or maybe group some rows, lose some conversions, compute some auxiliary values.
06:03 And finally, in the end, you get your results. So this is the standard way to work in pandas, by chaining these methods. Good job computing the average and median size of a country by population.
06:17 And congratulations for making it this far. Now, if this feels like this was a very short video course, remember that the point of this course was just to whet your appetites, to give you a sense of what it feels like to work in the world of data with pandas.
06:32 In the next lesson, you’re going to review everything you learned to do, and you will also be given more resources to keep exploring, to keep expanding your knowledge of the world of data and pandas itself.
Become a Member to join the conversation.
