Accessing pandas Data

pandas GroupBy: Grouping Real World Data in Python Christopher Trudeau 09:59

Transcript
Discussion

00:00 In the previous lesson, I did a review of how you construct a DataFrame. In this lesson, I’ll review how to get at the information inside of one.

00:09 pandas is all about the data and as such there are lots of different ways of getting at the info inside a DataFrame. Later in the course, you’ll be using these to slice and dice to group your data together.

00:20 Just how many different ways are there of getting at your stuff? Well, for starters, you can reference a column in a DataFrame using dot notation. In the book example I used in the previous lesson, you can get at that title column using dot title.

00:34 This only works if the name of the column is a valid Python identifier, so you have to stick to alpha numerics and an underscore for this case, but that doesn’t mean you have to restrict how you name your columns.

00:46 If you’ve got a funkier name, you can reference the column like accessing the key in a dictionary, giving square brackets and the string name of the column. So much for the columns.

00:55 How about the rows? Well, remember that each row in a DataFrame has an index value. This can be a little different if you’re just thinking of it as an array and some accessing can be counterintuitive.

01:08 For example, you can’t just use square brackets in a number as the square brackets mechanism is overloaded for the column names, which I just described, but oddly enough, you can use a numeric slice.

01:20 I don’t recommend it as it’s confusing, but it is there. The DataFrame is just a big two dimensional thing, so it might not be surprising that you can get at a cell by using square brackets on a column reference.

01:33 So that can either be dot column name, square brackets, or two sets of square brackets for column and row. This can also be a source of confusion as the access order is column first, like in a spreadsheet rather than row first, like in nested lists or like in row first with .loc[].

01:53 So yeah, consistency. Isn’t this library’s best feature! What’s .loc[]? Well, it’s an attribute on the DataFrame, which you can use with square brackets to get at different parts of it.

02:04 Using .loc[] you can get at a row, a column or individual cells to go with .loc[] is .iloc[], which works just like .loc[], but instead of using column names and index values uses Python style numeric indices.

02:20 So if you want to use zero to mean the first of something .iloc[] is the way to do it. The final accessor I’ll be using in this course is .pop(), which is like its namesake in a stack it removes the thing being accessed.

02:34 Let’s go off to the REPL and see all of these in action.

02:39 On the screen here, I’ve got the same creation code as in the previous lesson, giving a DataFrame containing three books and here it is. Let’s start by getting at a column.

02:51 Using the square brackets always works in this case returning the title column. Note that what is coming back is a pandas Series object. You can tell that by two things.

03:01 First, the index for each item is present, and second, it has a little data type spec sheet printed at the bottom. The spec sheet tells you the name and type of the data inside the Series. Square brackets are extra characters and as I mentioned before, you can use a shortcut. If your column name is a valid Python identifier you can use it directly with dot notation.

03:24 And to prove my point, the type of the column is Series. The column is the first dimension. If you want to get at a cell, you can use square brackets to get at that as well.

03:38 Note here that 5473 is the value of the index. In this case, it’s a number, but indexes don’t have to be numeric. This can be a little confusing. If I tried df.title bracket zero, I’d get an error.

03:52 The index to the row is the indice in the square bracket, not the traditional Python counting index. You can see the DataFrame’s index using the index property.

04:05 Two things of note here. An index is a pandas object itself, and when you show it in the REPL, it prints the index values for the DataFrame as well as the type used to store it. Like the index, there is also a columns attribute.

04:19 And the columns attribute itself is an index object. This time the values in the index are the column names and the data type is object. Note that the default storage for a string is a Python object, and since all of my column titles were strings, their Python objects for pandas. You can get at the parts of an index using square brackets.

04:45 The column name in position two, that’s the third one, is last name. You can see the different types of the columns using the dtypes attribute.

04:56 Almost everything here is a Python object with the exception of the score, which it recognized as a float. Note that the birthdays were input as strings, so they’re stored as general Python objects.

05:09 This may not be ideal. You might want to be able to do date math on them. So pandas provides a conversion method.

05:21 The to_datetime() function in the pandas module can take a string or a Series and will attempt to convert it to a panda’s timestamp object.

05:30 This is panda’s own type of date-time storage. If you give it a Series, it converts each value in the Series into a timestamp giving you a new Series. Lots can go wrong with this.

05:42 It attempts to auto detect the format of the date which can go awry. You can provide a format argument to be explicit about what is being parsed, or there are other arguments like day first that hint that this is a date starting with the day as the first value as is done in some cultures.

05:59 Secondly, if you’re operating on a Series, everything in the Series needs to be consistent. If you’ve got dirty data, problems can ensue.

06:08 You can specify different ways of handling errors with extra arguments to this call, but generally, you’re gonna have to clean the data. And the one that bit me in the butt when creating this course is the timestamp only supports a range of values.

06:22 Originally, I included Robinson Crusoe in my list of books, but Defoe’s birthday year is 1660, which is before 1677, which is the lower limit of a panda’s timestamp.

06:35 This doesn’t mean you can’t store a date like that, it just means it has to stay as a Python object. Converting it to a timestamp gains you lots of speed, but there are restrictions imposed to get there.

06:47 Note that the to_datetime function is returning a new Series. It isn’t operating on the existing column. So to actually convert our column in the DataFrame, you need to reassign it.

07:05 All my caveats of dealing with dates notwithstanding, that’s pretty simple, right? You just assign the column a new Series and pandas takes care of it for you.

07:14 The resulting DataFrame looks the same, but when you look at the types,

07:21 the birth date is a date time 64 in nanosecond resolution instead of the generic object that it was before. Okay, I mentioned .loc[], let’s play with that for a moment.

07:35 .loc[] with an index value returns a single row. Passing an index value and a column to .loc[] gives a cell.

07:45 The second value to the brackets in .loc[] is the column. What if you want a whole column instead of just a cell? Well, to do that, you pass a colon as the value for the row.

07:56 Essentially it’s like an empty slice, and here you get the entire column as a Series. You can also use a colon for the column as well,

08:12 but that gives you the same thing as doing it without the colon altogether. At least that part’s a little consistent. The .loc[] attribute supports slicing.

08:26 Remember though, that indices for the first value are the index values of the DataFrame. They may not even be numeric. Here I’ve used a slice containing two rows and a single column giving me a Series in return.

08:42 Without the column specifier, you get the whole set of values for the two rows that were sliced. The .loc[] mechanism even supports comparison operations.

08:57 This is all the rows where the last_name column is equal to Kafka. Of course, I’ve only got one in my data, but you get the idea. If instead of using the DataFrames index value, you wanna use a traditional Python index you can do that with .iloc[].

09:15 .iloc[] works the same way as .loc[], except you give it Python style indices to specify a row or a row and column, or with slices.

09:31 Keep in mind when using .loc[] that the indices are inclusive exclusive, so slicing zero one is the same as just using zero.

09:44 That’s because a zero one slice is from zero to one, excluding the one.

09:52 Wow, that’s a lot of stuff. Next up, what you actually came here for. Grouping data in pandas.

Become a Member to join the conversation.