Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Understanding DataFrame Attributes

00:00 Let’s go over some of the ways that we can access the data in a DataFrame. We already did a little bit of this back in a previous lesson when we did the broad overview of pandas, so let’s quickly go over this.

00:13 Probably one of the two most important attributes in a DataFrame are the .index and the .columns attributes. To take a look at the index of a DataFrame, we just simply type .index.

00:26 And in this case, with our job candidates DataFrame, the index is RangeIndex that starts at 101 and stops at 108.

00:36 Now, this is a sequence type object. So for example, I can access individual elements of this RangeIndex by using list notation. So to access the first element, 101, and so on—the second element using the index value of 1.

00:54 To get the column labels, we use the .columns attribute of a DataFrame.

01:00 This returns an Index object, and it’s also a sequence, so we can access individual elements of this Index object just using regular list notation. So for example, if I want to access the third element, which would give me 'age', I can just use regular list notation. Now, both the .index and .columns attributes, they return Index objects, and Index objects are immutable.

01:26 So for example, I could access an individual Index object element, and if I wanted to change it to, say, 100,

01:38 I would get a TypeError because an Index object in pandas doesn’t support mutable operations. And I would get a similar error if I wanted to change a individual element of the .columns index.

01:51 However, I can change the entire index. So if we recall, this is a RangeIndex from 101 to 108, and so, for example, if I wanted to change this to, say, a range that starts at 10 and goes to 16 by passing in an arange NumPy object—so, this’ll go from 10 to 17. This will be similar to the Python’s range() function.

02:18 The NumPy arange() function will create a range, and the stop value is not included in the actual numbers that are generated.

02:30 Now, the .index attribute or object is an Int64Index object starting at 10 all the way up to 16.

02:40 All right, so that’s how you access the index and the columns of a DataFrame. Now, if you remember, there’s a third piece to a DataFrame, and those are the actual values.

02:49 To access the values, you use the .values attribute.

02:55 This returns a two-dimensional NumPy array where each of the rows of the NumPy array are the rows of the DataFrame. There’s also a method called .to_numpy(),

03:09 which does the same thing. Now, the pandas documentation suggests that you should use the .to_numpy() method instead because it does offer a little bit of flexibility by passing in a couple of keyword arguments. So read up on that if you want to specify the data type of the resulting NumPy array, or if you want to use the original data from the DataFrame by passing in a False value to the copy keyword or a True value if you want to make a copy of the data. Now, another important attribute of the DataFrame is the .dtypes attribute.

03:44 This returns a Series object with the column names as the labels and the corresponding data types as the values. So in this case, we see that the name and the city column, they both have a data type of object, whereas the age has an int64 data type and the py-score has a float64. The object data type is going to be used for strings or if you have a column with mixed data types.

04:09 Most of the times, you’re going to rely on pandas to specify the data types when you create a DataFrame but if you did want it to change the data types, you could use the .astype() method on a DataFrame, and it relies on passing in a keyword argument called dtype, which is a dictionary, and the keys of the dictionary are going to be the columns that you want to change and the values of the dictionary are going to be the data types that you want to convert to. So, for example, let’s suppose we wanted to change the age column to have a NumPy int32 data type and the py-score column to have a NumPy float32 data type. So this would save some memory.

04:59 Now, this would return a new DataFrame, and so let’s just save this in a DataFrame called df_. Let’s run that, and now let’s take a look at the data types for this new DataFrame.

05:15 Well, you see that the age is now int32 and the py-score is float32.

05:21 All right, now let’s take a look at some attributes that give us the dimensions and the size of a pandas DataFrame, and these are going to be similar to the NumPy array attributes: .ndim, .size, and .shape.

05:34 A DataFrame has a .ndim attribute. This is the number of dimensions—in this case, 2. And then the .size is going to return the total number of elements, so 28. And if we take a look at the .shape, this is a 7 by 4 tabular DataFrame.

05:53 We’ve got 7 rows and 4 columns, and that’s why got a size of 28. 7 times 4? 28. And the last attribute or method that you might find useful is the amount of memory used by your DataFrame.

06:09 This is obtained by using the .memory_usage() method. This returns a Series object with the column names as the labels of the Series object and the memory usage in bytes as the data values.

06:23 So in this case, the last two columns age and py-scorethey use 28 bytes of memory. That’s because each of the columns, they have seven values and it’s an integer data type, which takes up 32 bits, or 4 bytes, and 7 integers times 4 gives us 28 bytes.

06:43 Let’s do a quick recap of some of these attributes and basic methods that we’ve discussed. We went over the .index and the .columns attributes. These return the row labels of a DataFrame and the column labels.

06:55 The third component of a DataFrame are the values, and these are stored in the .values attribute, which can also be obtained by using the .to_numpy(), method on a DataFrame.

07:05 This returns a 2D NumPy array of values. The .dtypes attribute is a Series object containing the data types of each of the columns and the index for the Series object are the names of the columns.

07:18 And then if we wanted to change the data types of the DataFrame that we’re working with, we can use the .astype() method, and this will return a new DataFrame with the specified data types of the columns that we want to change.

07:31 Then there are three attributes that describe the size of the DataFrame, and these are very similar to the attributes in a NumPy array. These are .ndim, .size, and .shape. .ndim returns the number of dimensions of the DataFrame, .size is the total number of values, and then .shape returns a tuple containing the size of each of the dimensions of the DataFrame.

07:55 All right, so in the next lesson, we’ll talk about accessing and modifying data in a DataFrame.

Become a Member to join the conversation.