Understanding DataFrame Attributes
00:00 Let’s go over some of the ways that we can access the data in a DataFrame. We already did a little bit of this back in a previous lesson when we did the broad overview of pandas, so let’s quickly go over this.
00:13
Probably one of the two most important attributes in a DataFrame are the .index
and the .columns
attributes. To take a look at the index of a DataFrame, we just simply type .index
.
00:26
And in this case, with our job candidates DataFrame, the index is RangeIndex
that starts at 101
and stops at 108
.
00:36
Now, this is a sequence type object. So for example, I can access individual elements of this RangeIndex
by using list notation. So to access the first element, 101
, and so on—the second element using the index value of 1
.
00:54
To get the column labels, we use the .columns
attribute of a DataFrame.
01:00
This returns an Index
object, and it’s also a sequence, so we can access individual elements of this Index
object just using regular list notation. So for example, if I want to access the third element, which would give me 'age'
, I can just use regular list notation. Now, both the .index
and .columns
attributes, they return Index
objects, and Index
objects are immutable.
01:26
So for example, I could access an individual Index
object element, and if I wanted to change it to, say, 100
,
01:38
I would get a TypeError
because an Index
object in pandas doesn’t support mutable operations. And I would get a similar error if I wanted to change a individual element of the .columns
index.
01:51
However, I can change the entire index. So if we recall, this is a RangeIndex
from 101
to 108
, and so, for example, if I wanted to change this to, say, a range that starts at 10
and goes to 16
by passing in an arange
NumPy object—so, this’ll go from 10
to 17
. This will be similar to the Python’s range()
function.
02:18
The NumPy arange()
function will create a range, and the stop value is not included in the actual numbers that are generated.
02:30
Now, the .index
attribute or object is an Int64Index
object starting at 10
all the way up to 16
.
02:40 All right, so that’s how you access the index and the columns of a DataFrame. Now, if you remember, there’s a third piece to a DataFrame, and those are the actual values.
02:49
To access the values, you use the .values
attribute.
02:55
This returns a two-dimensional NumPy array where each of the rows of the NumPy array are the rows of the DataFrame. There’s also a method called .to_numpy()
,
03:09
which does the same thing. Now, the pandas documentation suggests that you should use the .to_numpy()
method instead because it does offer a little bit of flexibility by passing in a couple of keyword arguments. So read up on that if you want to specify the data type of the resulting NumPy array, or if you want to use the original data from the DataFrame by passing in a False
value to the copy
keyword or a True
value if you want to make a copy of the data. Now, another important attribute of the DataFrame is the .dtypes
attribute.
03:44
This returns a Series
object with the column names as the labels and the corresponding data types as the values. So in this case, we see that the name
and the city
column, they both have a data type of object
, whereas the age
has an int64
data type and the py-score
has a float64
. The object
data type is going to be used for strings or if you have a column with mixed data types.
04:09
Most of the times, you’re going to rely on pandas to specify the data types when you create a DataFrame but if you did want it to change the data types, you could use the .astype()
method on a DataFrame
, and it relies on passing in a keyword argument called dtype
, which is a dictionary, and the keys of the dictionary are going to be the columns that you want to change and the values of the dictionary are going to be the data types that you want to convert to. So, for example, let’s suppose we wanted to change the age
column to have a NumPy int32
data type and the py-score
column to have a NumPy float32
data type. So this would save some memory.
04:59
Now, this would return a new DataFrame, and so let’s just save this in a DataFrame called df_
. Let’s run that, and now let’s take a look at the data types for this new DataFrame.
05:15
Well, you see that the age
is now int32
and the py-score
is float32
.
05:21
All right, now let’s take a look at some attributes that give us the dimensions and the size of a pandas DataFrame, and these are going to be similar to the NumPy array attributes: .ndim
, .size
, and .shape
.
05:34
A DataFrame has a .ndim
attribute. This is the number of dimensions—in this case, 2
. And then the .size
is going to return the total number of elements, so 28
. And if we take a look at the .shape
, this is a 7 by 4 tabular DataFrame.
05:53
We’ve got 7 rows and 4 columns, and that’s why got a size of 28
. 7 times 4? 28. And the last attribute or method that you might find useful is the amount of memory used by your DataFrame.
06:09
This is obtained by using the .memory_usage()
method. This returns a Series
object with the column names as the labels of the Series
object and the memory usage in bytes as the data values.
06:23
So in this case, the last two columns age
and py-score
—they use 28 bytes of memory. That’s because each of the columns, they have seven values and it’s an integer data type, which takes up 32 bits, or 4 bytes, and 7 integers times 4 gives us 28 bytes.
06:43
Let’s do a quick recap of some of these attributes and basic methods that we’ve discussed. We went over the .index
and the .columns
attributes. These return the row labels of a DataFrame and the column labels.
06:55
The third component of a DataFrame are the values, and these are stored in the .values
attribute, which can also be obtained by using the .to_numpy()
, method on a DataFrame.
07:05
This returns a 2D NumPy array of values. The .dtypes
attribute is a Series
object containing the data types of each of the columns and the index for the Series
object are the names of the columns.
07:18
And then if we wanted to change the data types of the DataFrame that we’re working with, we can use the .astype()
method, and this will return a new DataFrame with the specified data types of the columns that we want to change.
07:31
Then there are three attributes that describe the size of the DataFrame, and these are very similar to the attributes in a NumPy array. These are .ndim
, .size
, and .shape
. .ndim
returns the number of dimensions of the DataFrame, .size
is the total number of values, and then .shape
returns a tuple containing the size of each of the dimensions of the DataFrame.
07:55 All right, so in the next lesson, we’ll talk about accessing and modifying data in a DataFrame.
Become a Member to join the conversation.