Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

The Pandas DataFrame (Overview)

The pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields.

DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc. In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they’re an integral part of the Python and NumPy ecosystems.

In this course, you’ll learn:

  • What a pandas DataFrame is and how to create one
  • How to access, modify, add, sort, filter, and delete data
  • How to handle missing values
  • How to work with time-series data
  • How to quickly visualize data

Download

Sample Code (.zip)

39.9 KB

Download

Course Slides (.pdf)

1.9 MB

00:00 Hey there! Welcome to another Real Python video course. I’m going to be your instructor, Cesar Aguilar, and guide you through the very basics of pandas and its main data structure called a DataFrame. What is pandas?

00:15 pandas is a fast, powerful, and easy-to-use data analysis manipulation tool. No really. What is pandas? pandas is just a Python module with extensive functionality for working with tabular data. And tabular data is all around us!

00:32 You can get tabular data off the internet, maybe somebody sends you a spreadsheet, or maybe you generate some report from a database. Maybe you’ve been using Python for some time now, and you’re happy to stick with your spreadsheet application and wonder what the fuss is about with this pandas library.

00:51 So, why use pandas? Well, let me give you a quick example here. Suppose your organization wants to fill a position for a Python developer. You did an initial screening of the candidates by giving them a Python test, and now you have a CSV file containing the Python scores.

01:08 The CSV file contains, in the first line, the headers, or the field names, of the CSV data. You’ve got the name of the candidate, the city where they’re from, their age, and then the score on the Python test.

01:22 In addition, each row contains some sort of identifier that uniquely identifies each of the candidates. And what you’d like to do is pull out, or extract, from the CSV file all of the candidates that scored at least an 80 on the Python test, and you want to write those candidates in a new CSV file so you can go ahead and continue the hiring process with those candidates.

01:46 Let’s take a look, then, at what you’d write if you were to do this sort of manually in Python. As you can see, there’s quite a bit of code. Let’s go over the main idea.

01:57 You’re opening up a file that you’re going to read that contains the job candidates information and then you’re going to also open up a file with write mode that will be used to write all of the candidates that have at least a 80 score in the Python test.

02:11 You can see that it’s quite a bit of code to get something done that’s fairly straightforward. When you do create this file, you get the top candidates that had at least a Python score of 80.

02:24 And you may have noticed that in the code there wasn’t anything done with, say, sorting the candidates, say, from highest Python score to lowest Python score.

02:34 We just loop through each of the candidates and then save their information if their Python score was at least 80.

02:43 Now let me show you how you would do this exact same thing using the pandas module. So, here it is. Here’s the code that you would need if you were going to use the pandas module to accomplish the exact same thing.

02:57 Just a few lines of code, where you’re opening up the file, extracting only the candidates that have at least a score of 80, and then simply writing that information to a CSV file.

03:10 And, in fact, if you wanted to sort the candidates from highest Python score to lowest Python score, you would add one more function call using a method that we’ll talk about called .sort_values().

03:22 And then once you’ve done that, you’ll go ahead and write to a CSV file.

03:27 And in this case, we actually get a CSV file where the candidates are sorted from highest Python score to lowest Python score. Now, this is a tiny peek on what you can do in pandas. I mean, if you’re only working with a basic CSV file and tabular data and just want to do basic analysis, already maybe you’re tempted to go ahead and learn pandas. Believe me, once you learn pandas, you’ll wonder why you never took the time to learn it earlier.

03:58 Let’s talk about what we’ll do in this course.

04:02 This is going to be an introduction to pandas and its main data structure, the DataFrame. The course assumes no prior knowledge of pandas. We’ll start at the very beginning. But of course, if you know a little bit of NumPy, that’s a plus but certainly not necessary.

04:19 And the only real thing that you actually need from NumPy is knowing that you can do fancy indexing and that you can perform operations on columns instead of just individual cells.

04:31 I’m going to be using a Jupyter Notebook in this course, and so if you know the basics of Jupyter, then that’s a plus, but certainly not necessary. And if you’ve never used it before, after this course, you’ll know all the very basic functionality in a Jupyter Notebook. Here’s a broad overview of what we’ll do in this course.

04:51 Of course, we’ll talk about the pandas DataFrame—how to create one and what are its basic attributes and methods. We’ll talk about how to access, modify, add, sort, filter, and delete data in the DataFrame.

05:04 We’ll talk about how do we handle missing values. With real-world applications, a lot of times the data comes in and some of the data is missing.

05:13 One of the main reasons why pandas was created was to create a flexible tool to perform quantitative analysis on financial data, and so we’ll certainly talk about time-series data.

05:25 And then we’ll briefly touch on some quick ways that you can visualize data.

05:31 All right! So, I’ve given you only a little taste of what pandas is about, but hopefully you have enough motivation to want to learn this awesome tool and work with tabular data in ways that you didn’t think was possible.

Become a Member to join the conversation.