Getting to Know pandas DataFrames
If you’re following along with this lesson and not using the provided Jupyter Notebook from this course’s supporting materials, you can copy-paste the following data
dictionary:
data = {
'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
'Manchester', 'Cairo', 'Osaka'],
'age': [41, 28, 33, 34, 38, 31, 37],
'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}
For more information about pandas DataFrames, take a look at Intro to data structures: DataFrame in the pandas documentation.
00:00 In this lesson, we’ll do a quick overview of creating a pandas DataFrame and how to access rows and columns in the DataFrame.
00:09 A DataFrame is a data structure used to represent tabular data. A DataFrame is going to be one of the main data structures that you’re going to be working with in pandas. As a running example that we’ll use in this video course, we’re going to use this table that contains information about job candidates for a position that we want to fill for a Python developer.
00:33 This is the same tabular data that we used in the overview lesson. We’ve got the name of the candidate, the city that they’re from, their age, and then the score in their Python test. At the very top, we’ve got what are called column labels, or field labels, and then at the very left, we’ve got what are called the row labels, or the index, of the table.
00:58 Then, at the intersection of a row and a column, we’ve got a cell, or the data.
01:06 Let’s head over to Jupyter and create our first DataFrame that will contain these three pieces of information: the row labels, the column labels, and the actual data.
01:18 Back here in Jupyter, what we’re going to do is first create a little bit of Markdown just as a way for us to create a Notebook that will contain both titles and headings and information about what we’re doing in the code—sort of as a way to provide comments, as well.
01:36 When you create a new Jupyter Notebook, the type of cell that you’re going to be presented with is a code cell to write Python code. I’m going to convert to a Markdown cell.
01:48 I’m going to hit Escape, and that takes me out from edit mode to command mode, and you notice that the cell became blue. And then from here, I’m going to hit M, and that converts into a Markdown cell.
02:03 And either by pressing Enter or moving the cursor to the cell, I can now write in my Markdown. If you’re not familiar with Markdown, it’s pretty straightforward. Let’s suppose we wanted to make a heading.
02:15
I’m going to make a hash symbol (#
), and then maybe we’ll call this, say, # Introducing the pandas DataFrame
.
02:25
Shift + Enter executes that cell and then gives you a new cell, and the default new cell is going to be a code cell. Let’s go ahead and import pandas
.
02:36
The widely used alias for pandas
is pd
. Let’s go ahead and Shift + Enter. Now, there are many ways to create a DataFrame in pandas.
02:46 We’re going to use a dictionary, and the keys are going to serve as the column labels and the values are going to serve as the data in each of the columns.
02:57
I’m going to copy-paste the dictionary that I’m going to call data
, which contains, as the keys, the names of the columns that I want to use, and then the values are going to be the data for that column.
03:11
You can go ahead and type this dictionary out, or you can just copy-paste it from the data that accompanies the video course. I’m now going to create a list that will contain the index labels, or the row labels, for our DataFrame. I’m going to call this, say, index
.
03:29
I’m going to use the numbers 101
to 102
, and so maybe what I’ll use instead is Python’s range()
function. Start at 101
and go to 107
, so I’m going to go 108
. Run those cells with Shift + Enter, and then let’s use the constructor for the DataFrame
object.
03:54
We’ll pass in the data and then we’ll specify the index for the rows. That is using the keyword argument index
, and the index labels are stored in that index
range
object that we created, which could have been a list
or a tuple
.
04:12
Let’s run that, Shift + Enter, and there we get a nice printout of the DataFrame. Now let’s save this DataFrame in, say, the variable df
,
04:24
run that again, and let’s take a look at the type of df
.
04:30
We’ve got the pandas
module, and then we’ve got modules contained within pandas
, and then the final type is called the DataFrame
.
04:40
Now let’s take a look at two important attributes of a DataFrame
object. These are the .index
, which in this case is a RangeIndex
object.
04:50
It starts at 101
and it ends at 108
, and a step
of 1
. So this is, essentially, like a range
object that you’re used to in Python.
04:59
Then we’ve got the .columns
attribute, and this is a pandas Index
object. If we take a look at the type,
05:09
this is an Index
object in the pandas
module, which is also one of the main data structures in pandas. There are two important methods on a DataFrame
object, and these are the .head()
and the .tail()
.
05:22
You may be familiar with the two Unix Bash commands with the same names, head
and tail
. What they do is they print only the first five rows—that’s for the .head()
—and then for the .tail()
,
05:38 this returns the last five rows of the DataFrame. So this is useful, for example, when you’ve loaded up or you’ve created a DataFrame that contains many rows and you want to get a quick visual of the DataFrame, so a good thing to do is just take a look at the first five rows or so, or the last rows, and you can also pass in a value for how many rows you want to view.
06:00
So for example, if we just want to view the first three, we can pass in an argument of 3
, and so on. All right, so there you go! You created your first pandas DataFrame.
06:11
You also took a look at a couple of important attributes, which are the index and the columns on a DataFrame, and a couple of useful methods called .head()
and .tail()
, which allow you to take a little peek in on the DataFrame, either from the top rows or from the bottom rows.
06:26 Let’s continue with this introduction to pandas by going over how you access individual rows and columns in a DataFrame.
Martin Breuss RP Team on Aug. 17, 2021
Well spotted @ymlin and thanks for the heads up. I fixed the missing comma in the description :) Glad you’re enjoying the course!
macro84 on Nov. 8, 2021
How can I remove columns that have NaN’s between a certain data range (the date being the index in one dataframe and same thing for another dataframe where date is a column? Thanks
Become a Member to join the conversation.
ymlin on Aug. 16, 2021
Great introduction. It looks like a comma is missing at the end of ‘city’ in the Description (it’s correct in the video), corrected as below: