Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Creating DataFrames

Cesar Aguilar

The pandas DataFrame: Working With Data Efficiently Cesar Aguilar 08:39

If you’re following along with this lesson and not using the provided Jupyter Notebook from this course’s supporting materials, you can copy-paste the following lst and lst2 lists:

Python
      
    
lst = [{"x": 1, "y": 2, "z": 100},
       {"x": 2, "y": 4, "z": 100},
       {"x": 3, "y": 8, "z": 100}]

lst2 = [[1, 2, 100],
        [2, 4, 100],
        [3, 8, 100]]

00:00 All right, so before we keep going, let’s save this Jupyter Notebook. Head on over to View > Toggle Headers if your headers are hidden and click on the Untitled label and then give your Notebook a name.

00:13 Maybe we’ll call this pandas_dataframe.

00:20 Then go ahead and take off the headers again. Let’s now talk more in detail about how to create DataFrames and the different ways that you can do it. Let’s create a new Markdown cell, and maybe what we’ll do is we’ll give it a heading level number two, and then we’ll call this ## Creating a pandas DataFrame.

00:41 All right, so we’ll talk about the different ways to do that. And probably the best way to learn is to take a look at the help documentation that comes with the DataFrame constructor.

00:52 So go ahead and type pd.DataFrame, and within Jupyter, if you put your cursor over a function or a method and you click on Shift + Tab, you’re going to get a pop-up window, which displays the signature of the function.

01:10 And if you hit on the plus icon, the window will grow and you’ll get to see the entire documentation for the method. So this is the DataFrame() constructor, the keyword arguments are data, index, columns, dtype, and copy.

01:26 This will create the primary pandas data structure, and it’s a data structure—a 2D data structure—that will have labeled rows and columns. And even here within the documentation, it says that it can be thought of as a dict-like container for Series objects. So again, you can think of a DataFrame as containing Series objects, either in terms of rows or columns.

01:53 Let’s quickly take a look at some of these parameters. The data keyword argument can be a NumPy array, it could be an iterable object, a dictionary as we saw, or it can be some other DataFrame. And the dictionary object, if it’s passed in, can contain other Series objects or arrays or list-like objects.

02:14 And the index keyword argument is going to be any index object, or it can be another array-like object. If none is passed in, then it’s going to default to a RangeIndex, and so basically these are going to be indexed using integers.

02:31 You can also pass in column names, and the names that you pass in will determine the order of the columns. And again, the default here will be a RangeIndex if you don’t pass in column names.

02:44 And then you can also specify the data types of the values or the cell values of your object, and so if you want to enforce certain data types, you can go ahead and do that. And then lastly, the copy keyword argument, if you pass in a DataFrame or a NumPy array in your data, then any changes that you make to the NumPy array or the DataFrame will affect the DataFrame that you’re constructing and vice versa.

03:10 So if you want to construct a DataFrame, that’s independent from the data that you’re passing in—just change the copy value to True. All right.

03:20 So let’s close that up, and why don’t we create a DataFrame using a dictionary just like we did in the previous lesson, but pass in an ordering of the columns.

03:33 Let’s first import numpy. We’ll use this to create arrays that we can pass in as data.

03:41 Let’s create a few cells just so we can push everything to the top. And let’s create a dictionary. We’ll have the keys are going to be 'x', 'y', 'z'.

03:56 And let’s make a NumPy array, [2, 4, 8].

04:00 And the last column, called 'z', we’ll pass in a single value. And then now let’s create a DataFrame using that data or that dictionary. And if we don’t pass in any values for the keyword arguments index and columns, in this case, because we’re passing in a dictionary, the keys of the dictionary are going to act as the labels for the columns and then the default indices 0, 1, 2 will be used for the row labels.

04:31 Now let’s go ahead and change this. Why don’t we make the index have row labels 100, 200, and 300. And then for the columns, we can specify a ordering—say, 'z', 'y', and 'x'.

04:49 And so we get this DataFrame.

04:53 Notice that for the z column, the value of 100 was repeated for every cell in that column.

05:02 Let’s suppose, instead, that the data that you get is a list of dictionaries. I’m going to paste some of the data—and you can get this data from the video description. So here, what we get is a list of dictionaries, and each of the dictionaries contain the same keys.

05:23 If we use this data to create our DataFrame,

05:32 the keys of the dictionary will be the column labels. And because we didn’t pass an index, the default values of 0, 1, and 2 will be used.

05:41 If, instead, we wanted the index to be, say, labeled by the letters 'a', 'b', and 'c', we would just pass those in.

05:54 Now, you can also use a list or a nested list of the data. So let me create another list. And again, you can get this data from the video description or just type this out.

06:07 We can create a DataFrame this way. What will happen here is each of the lists in this nested list are going to be the values in each row. So if we run that,

06:22 we see that the row 1, 2, 100 is the data that we passed in as the first list, and so on. Again, you can go ahead and maybe give the columns more descriptive label names depending on what the data’s supposed to represent. So, in this case, we’ll just pass in ['x', 'y', 'z']. And again, you can pass in the index if you don’t want the default values from 0 to the number of rows. Now, let’s suppose instead of a nested list your data is stored in a NumPy array.

06:56 So let me grab this nested list, copy it, and let’s create a NumPy array.

07:07 Let’s create a DataFrame using this NumPy array. We’ll call it df_

07:15 and we’ll use that array, and maybe we’ll pass in the columns to be 'x', 'y', and 'z'. This will create the DataFrame very similar to what we had before.

07:30 However, because the data that we passed in is a NumPy array, if we change a value in the NumPy array—say, the 1, 1 entry to 33—then when we take a look at the DataFrame, the corresponding entry also changes to 33.

07:51 Now, this may be something that you want, but if it isn’t, then you should pass in a value of True to the keyword argument copy.

08:02 So we’ll go ahead and rerun this cell and make sure the DataFrame contains the data that we’re passing in from the original NumPy array. Then let’s run that command that changes the value of the 1, 1 entry. And then if we take a look at the DataFrame, it still has the original values that were obtained when we first created the array and pass that in for the data.

08:33 Coming up next, we’ll take a look at how we can create a DataFrame from a CSV file.

Become a Member to join the conversation.