Understanding Your Data
00:00 Now that you have your data locally in easily accessible file formats, you want to understand your data. And this is meant with a double meaning. Because on the one hand, you have to understand the data itself.
00:13 What is it that you’re looking at? How will you answer your question with the data you have access to? And also in the world of Python and pandas, what is this object?
00:23
You have the variable data. What does it hold? What’s this thing that looks like a table?
00:29 So go ahead and open your notebook.
00:33
Once you open your notebook, you want to make sure to define the variable data, holding the data that’s in your CSV file. And in case you were not able to produce the CSV file in the previous lesson, you should be able to download it through the video course interface.
00:48 So you should find a download button that gives you access to the CSV file and to these notebooks we are writing together.
00:56
Now that your variable data is defined, you can go ahead and type type(data) so that Python tells you what’s the type of this object.
01:06
And you will see that Python tells you. This is a pandas.core.frame.DataFrame. And DataFrame is the key here because DataFrames are the bread and butter of pandas.
01:17 And whenever you are doing data science projects with pandas, you will be using DataFrames a lot.
01:23 So in short, a DataFrame is this thing you are looking at. It looks like a table. It’s a two-dimensional structure that contains columns, and along each column, you find a different variable.
01:35
For example, the column Location has strings with the names of these countries and of these dependencies. The column Population has numbers with the number of habitants per country.
01:47
The column Date has dates showing you the date when these figures were last updated. So these are the columns. And one important feature is that in a column, the values all have the same type.
02:00
So the column Population only has integers, the column Location only has strings, et cetera. And along the rows you have observations.
02:10 You have data points. For example, the third row has information about China and its population,
02:18 the fourth row has information about the United States and its population, the fifth row about Indonesia, et cetera. In short, this is what a DataFrame is.
02:28 It’s a way of organizing your data with different variables along the columns and data points, observations, along the rows.
02:37 Now, there’s one thing here that would warrant an entire video course, but I just don’t want you to feel like this was not acknowledged. So, why does it look like there are two columns that are duplicated?
02:51
Because in reality, this Unnamed is a column with integers. This thing on the left isn’t really a column. These are indices. pandas has a very flexible and very powerful idea called an index, and knowing how to work well with indices is very important in pandas.
03:09 And it’s not really a regular column. It’s just information about each row. And when you first save your data to CSV, the index also got saved. But when it got read back, pandas thought it was a regular column and created the new index.
03:25 That’s why this looks like it’s duplicated. In order to remove the duplication, you would’ve had to tweak either the writing of the CSV or the reading of the CSV, but it’s not super important.
03:37 Now, I just wanted to acknowledge it so that you don’t feel like there’s something weird going on.
03:43 So that’s what the data is from the point of view of Python. Now, from the point of view of the data itself, you would have to read the Wikipedia page to figure out what they mean by country or by dependency because you can see here you have, for example, India and China and the United States, which you would refer naturally to as countries.
04:01 But then you have things like the Vatican City or the Pitcairn Islands, which are not necessarily countries, independent states recognized by the whole world.
04:12 So Wikipedia has a specific definition of what’s included in this table, and you would have to make sure that that aligns with what you need. But for the sake of simplicity, we’re just going to assume that everything that’s listed in this table is going to be relevant for the computation that you’ll perform in the next lesson.
04:33 So now that you have a basic idea of what the DataFrame is and what’s contained in your DataFrame, it’s time to grab your data and finally analyze it.
Become a Member to join the conversation.
