Refreshing pandas Knowledge

pandas GroupBy: Grouping Real World Data in Python Christopher Trudeau 07:38

00:00 In the previous lesson, I gave an overview of the course. This lesson is a refresher on the parts of pandas you need to know for the rest of the course. The focus of this course is the pandas groupby() method, and I’m going to be making some assumptions about your pandas knowledge.

00:15 If you’ve never done pandas before, it might be better for you to start with a more in-depth course like this one. The pandas concepts that I’m going to assume you are familiar with are creating and accessing DataFrame and Series objects and understanding the difference between them.

00:30 When I say accessing, I mean using dot access, square brackets, slicing, the .loc, and ilock1 methods as well as .pop().

00:39 I won’t be doing a lot with data types in the course, but you should understand that a pandas column has type information associated with it, and that type could be a Python object, a variety of integer and float sizes, dates, and their support for NumPy types in panda columns as well.

00:55 I’ll be using four specific functions from the pandas module, the CSV reader, the date-time column parser, the column renamer, and the set index call, which changes which column is the index.

01:09 I’ll also be using the inplace argument to several of these, so you should understand the difference between the default for many methods, which is duplicate the data frame versus doing in-place modification.

01:20 If you need to review, or one or two of these concepts is new to you, don’t worry I’ll cover that first. If everything I’ve mentioned here is old hat for you, feel free to skip forward to the actual GroupBy lesson.

01:33 Okay, let’s start the review. pandas is all about playing with data and its main model for doing this is very similar to a spreadsheet. In memory you store data as a series of rows and columns.

01:45 This storage mechanism is based on a class, which is the central idea in pandas, a data frame. In Excel and other spreadsheets, you can identify a single cell by referencing a column and a row using a letter and a number.

01:59 For example, B3 is the second column and the third row. You can tell it’s a non-programmers tool because it doesn’t start at zero. A DataFrame does something a little different from that.

02:11 Instead of using the alphabet to identify a column, you give each column a name. This is typically done at creation time, but can also be modified. Most CSV files have a header row, which names the columns in the file.

02:25 A DataFrame does the same thing. You can use this column name to access the column in the DataFrame in order to manipulate it. There are also ways of doing it by number, but I’ll come back to that later.

02:37 Instead of using a row number, like a spreadsheet, each row in a DataFrame gets a label. This set of labels is known as an index. You can provide your own index based on the data you are using or, like with Excel, you can have it automatically created as an incrementing number. pandas also has its own idea of a sequence, which it calls a Series.

03:00 A Series is like a list in that it’s an ordered one-dimensional storage, but what makes a Series different is that it has an index.

03:09 A DataFrame is actually just a collection of Series objects where each Series is a column in the DataFrame and the DataFrame and every Series within it shares the same index reference.

03:21 The shape of a DataFrame is the size of each of its dimensions. In this course, I’ll be sticking to two-dimensional tables, so the shape is the number of rows and the number of columns.

03:32 The shape of a Series is the number of items in the Series. If you have a Series that has the same number of items in it as the rows in a DataFrame, it can be glued into the DataFrame.

03:43 A shared dimension size and common indices are really important later in the course. When you start slicing and dicing the data using groupby(), the selection of what you’re grouping can impact the performance.

03:55 Okay, grab your bamboo. It’s time to head to the REPL and play with some pandas.

04:01 pandas is all about manipulating data so to start off, I’m gonna need some data. Let’s create a little dataset about books.

04:13 When I use a dict here and inside the dict, I’m going to have a key for each column with the corresponding value being a list of content. The titles here, make up a column in the DataFrame.

04:29 Copied this data from library thing, good tool, even if the UI looks like somebody’s nephew’s MySpace page from 2010, and these were the ratings on the site for those three titles.

04:42 The library thing, community appear to be some tough markers.

04:49 Next up is the authors’ last names,

04:56 their first names,

05:04 and finally the authors’ birthdays.

05:09 This dictionary contains the data for five columns, title, score, last and first name, and birthdate, and three rows, Metamorphosis, Time Machine, and Mockingbird.

05:19 Don’t forget though, pandas needs an index for each row. I could let it auto generate it, but then I wouldn’t be able to easily map it back to the library thing data.

05:29 So instead, I’ll create a list of index numbers.

05:35 These numbers correspond to the IDs on library thing for the three titles. Time to create a DataFrame. First, I’ll import pandas.

05:46 It’s a very common convention when using the pandas library to alias the module as pd like I’ve done here. Save some typing. And now for the DataFrame.

06:02 The DataFrame constructor accepts several different formats of data. I’ve used the dictionary above, but you can also do a list of dictionaries where each dict acts like a record.

06:11 That format is actually pretty common in the real world as you’re more likely to access data as a grouping of books rather than chopped up as columns like I’ve done here.

06:20 But the dictionary style that I’ve used takes up less space on the screen. The second argument to the DataFrame constructor is the data to use as the index, which here is my list of book IDs.

06:31 Let’s look at the contents

06:34 and you can see that pandas creates a table. Each of our five keys from the dictionary becomes a column header. The three indexes are there on the left, and each row contains all the data for each book.

06:47 I’ve only got three books in my data, so it isn’t hard to see it all on the screen, but in the real world, you’re not likely to be using pandas just to play with three things.

06:55 You’re probably going to have hundreds or thousands of rows. With longer tables, what pandas will display in the REPL is the first few rows, then dot dot dot then the last few rows.

07:06 You can get at the first few rows yourself by using the head() method.

07:12 By default head() returns the first five rows, but here I’ve given it an argument, so it only shows the first two

07:19 and to go along with head(), of course there is tail(),

07:24 which shows the last five rows. Unless like I’ve done, you give it an argument.

07:31 There’s still some more review to come. Next up how to access things in a Data Frame or Series.

Become a Member to join the conversation.