Understanding DataFrame Attributes

The pandas DataFrame: Working With Data Efficiently Cesar Aguilar 08:02

00:00 Let’s go over some of the ways that we can access the data in a DataFrame. We already did a little bit of this back in a previous lesson when we did the broad overview of pandas, so let’s quickly go over this.

00:13 Probably one of the two most important attributes in a DataFrame are the .index and the .columns attributes. To take a look at the index of a DataFrame, we just simply type .index.

00:26 And in this case, with our job candidates DataFrame, the index is RangeIndex that starts at 101 and stops at 108.

00:36 Now, this is a sequence type object. So for example, I can access individual elements of this RangeIndex by using list notation. So to access the first element, 101, and so on—the second element using the index value of 1.

00:54 To get the column labels, we use the .columns attribute of a DataFrame.

01:00 This returns an Index object, and it’s also a sequence, so we can access individual elements of this Index object just using regular list notation. So for example, if I want to access the third element, which would give me 'age', I can just use regular list notation. Now, both the .index and .columns attributes, they return Index objects, and Index objects are immutable.

01:26 So for example, I could access an individual Index object element, and if I wanted to change it to, say, 100,

01:38 I would get a TypeError because an Index object in pandas doesn’t support mutable operations. And I would get a similar error if I wanted to change a individual element of the .columns index.

01:51 However, I can change the entire index. So if we recall, this is a RangeIndex from 101 to 108, and so, for example, if I wanted to change this to, say, a range that starts at 10 and goes to 16 by passing in an arange NumPy object—so, this’ll go from 10 to 17. This will be similar to the Python’s range() function.

02:18 The NumPy arange() function will create a range, and the stop value is not included in the actual numbers that are generated.

02:30 Now, the .index attribute or object is an Int64Index object starting at 10 all the way up to 16.

02:40 All right, so that’s how you access the index and the columns of a DataFrame. Now, if you remember, there’s a third piece to a DataFrame, and those are the actual values.

02:49 To access the values, you use the .values attribute.

02:55 This returns a two-dimensional NumPy array where each of the rows of the NumPy array are the rows of the DataFrame. There’s also a method called .to_numpy(),

04:59 Now, this would return a new DataFrame, and so let’s just save this in a DataFrame called df_. Let’s run that, and now let’s take a look at the data types for this new DataFrame.

05:15 Well, you see that the age is now int32 and the py-score is float32.

05:21 All right, now let’s take a look at some attributes that give us the dimensions and the size of a pandas DataFrame, and these are going to be similar to the NumPy array attributes: .ndim, .size, and .shape.

05:34 A DataFrame has a .ndim attribute. This is the number of dimensions—in this case, 2. And then the .size is going to return the total number of elements, so 28. And if we take a look at the .shape, this is a 7 by 4 tabular DataFrame.

05:53 We’ve got 7 rows and 4 columns, and that’s why got a size of 28. 7 times 4? 28. And the last attribute or method that you might find useful is the amount of memory used by your DataFrame.

06:09 This is obtained by using the .memory_usage() method. This returns a Series object with the column names as the labels of the Series object and the memory usage in bytes as the data values.

06:23 So in this case, the last two columns age and py-score—they use 28 bytes of memory. That’s because each of the columns, they have seven values and it’s an integer data type, which takes up 32 bits, or 4 bytes, and 7 integers times 4 gives us 28 bytes.

06:43 Let’s do a quick recap of some of these attributes and basic methods that we’ve discussed. We went over the .index and the .columns attributes. These return the row labels of a DataFrame and the column labels.

06:55 The third component of a DataFrame are the values, and these are stored in the .values attribute, which can also be obtained by using the .to_numpy(), method on a DataFrame.

07:05 This returns a 2D NumPy array of values. The .dtypes attribute is a Series object containing the data types of each of the columns and the index for the Series object are the names of the columns.

07:18 And then if we wanted to change the data types of the DataFrame that we’re working with, we can use the .astype() method, and this will return a new DataFrame with the specified data types of the columns that we want to change.

07:31 Then there are three attributes that describe the size of the DataFrame, and these are very similar to the attributes in a NumPy array. These are .ndim, .size, and .shape. .ndim returns the number of dimensions of the DataFrame, .size is the total number of values, and then .shape returns a tuple containing the size of each of the dimensions of the DataFrame.

07:55 All right, so in the next lesson, we’ll talk about accessing and modifying data in a DataFrame.

Become a Member to join the conversation.