Basic Pandas Data Structures

In this lesson you’ll get an introduction into Pandas’ basic data structures: Series and DataFrame. However, this video focusses on the Pandas Series data structure.

00:01 We need to create a new Notebook. Since we’re going to be using stock data here, we’re going to call it Stocks. I’m going to simply do something like this.

00:11 So, we’re going to simply start off by calling our first portion here, we’ll call it # Pandas. We’ll just have it be a playground for the Pandas data points here.

00:21 Let’s start by importing a few of the data structures that Pandas includes. So, from pandas import DataFrame, Series. We’re going to first begin with a Series DataFrame, which is pretty awesome.

00:39 What it’s designed for is to explicitly handle the time-indexed data points. So let’s say, yesterday you had five apples, today you have four, tomorrow you have two.

00:48 That’s the type of thing that the Series object is really good at handling. I’m going to start off with the “Hello, World!” version of Series objects.

00:55 What you need to do is pass it some values, so these would be your apples. Okay, so let’s say you had [1, 2, 3, 4], and then you also could pass an optional index, which would be something like this, which you’d use to index the thing.

01:09 As I said before, these are most useful when they are dates or timestamps or something that’s happening over a period of time, which is the real power of Series.

01:19 Something like stock data, as you’ll see later in another example. But what you can see here is that…

01:28 Oh, I missed a comma there. What you’ll see here is that

01:34 Pandas takes that and makes it into a representational data format. Now it’s representing the int64. The Series data only has things across the top so you understand what is going on.

01:48 So let’s say we’ll only have one set of values, so these sets of values are int64. But if I were to go on ahead and make these all floating points,

02:02 I believe they’d be float64. And then that’s how you go about doing stuff like that. So, another thing you can do here is you can go s.index, and that’ll give you the index column. As you can see, they’re objects, they’re strings saying what the index is. And that’s how you go about it.

02:18 You can do all kinds of other things, which you can dive into the documentation to get. So you can do the mean of that, which ends up being the sum of all of them divided by the length.

02:28 And there’s a bunch of other options that you can do with Series data with Pandas. Next, I’m going to show you an example with some time stamps over time and other things you can do with Series.

02:38 The very first thing we’ll need is some random data, so we’re going to go import random. Then we’re going to go do data = [random.randint()], between 0 and 10000 for x in xrange() of 10000.

03:04 We’re then going to go provide an index. That index will be DateTimeIndex. It starts on January 1st, 2013. The periods will be—that’s the number of samplings we would take—is equal to the length of data. And how frequently they’re sampled is provided by the freq (frequency).

03:32 We can then go something like this. We’re going to go minutely. We’re then going to go s = Series(data, index=index). So, what we really did was first,

03:46 create some data.

03:50 Then what we did was create a DateTimeIndex

03:59 and provide start

04:03 and freq. So, this is a minutely frequency, so when we look at our object here, what we should see is the first minute in January 1st, 2013, second minute in January 1st, 2013, and so on and so forth.

04:19 So as you can see here, we have 10,000 things. Frequency is minutely, of type int64. So, that’s how the Series objects look. You can do a bunch of things like .tail() once you’re dealing with a lot of data.

04:34 You can look at the last, by default it says five, but you can provide a number here, like 10. .head() is vice versa, it’ll give you the first ten, like so. The really cool thing that you can do, though, is seeing that we have s now here… We’ll call that s, we’ll evaluate that out.

04:52 All right. I believe that’s the case. Next, we’ll go s_daily = s.resample(), resample that at a daily frequency. So what that ends up doing, it ends up resampling all the Series objects that you have in your data and it gives you all of the days that we span it to.

05:16 So according to this, it takes over 10,000 samplings minutely, it gives us about seven days of data. And as you can see here, the totals for each of those days, they’re added together.

05:27 So that gives you an easy way from going from a very low frequency to a very high frequency. You can fill forward to fill back. If you go from a low frequency to a much higher frequency, you can fill, you can carry forward. That’s generally how you’d use Series objects, and that’s where really their power lies. Next, let’s go into DataFrames.

05:47 In the previous example, I said this was calculating the sums. This is incorrect. It is actually calculating the means. In order to calculate the sums, you need to pass a how method to the .resample(), which will then resample and then sum the daily values.

06:02 So, it’ll sum all the values that are in the particular day here. So when we run this again, the numbers are much longer. It makes much more sense.

Avatar image for Bill Sewell

Bill Sewell on March 14, 2019

xrange threw me an error (name xrange is not defined), but range worked. Why would it work in your example but not mine?

Avatar image for Dan Bader

Dan Bader RP Team on March 16, 2019

@Bill: That’s because this video series uses Python 2.x, and xrange is no longer available in Python 3.x, where you’d use the range function. Some more info here.

Avatar image for Sciencificity

Sciencificity on March 16, 2019

Hi Dan, With Python 2.x not being supported from 2020 shouldn’t all the video tutorials on RP be for Python 3.x now? That would be really appreciated - being someone new to python I googled this to figure out why mine did not work, but my expectation going in was that I would be watching an up-to-date pandas tutorial, and then I got disappointed when I figured out it was 2.x being used, hence the error. I had similar issues with the stocks data pull and vincent (the last exercise in this tutorial) and that led me to give up and not bother googling further - and I would have really loved to complete that exercise! Thanks.

Avatar image for Dan Bader

Dan Bader RP Team on March 16, 2019

shouldn’t all the video tutorials on RP be for Python 3.x now?

I agree and our upcoming tutorials will all use Python 3 :) That said, I think there’s still a benefit to having some Python 2.x specific content available, but it needs a better disclaimer at the start of the course. I’ll work on adding those!

Avatar image for Pucho

Pucho on Dec. 7, 2019

Hi there,

Just for other people using python3.

As mentioned above, replace xrange with range. DatetimeIndex has been deprecated in favor of date_range.

import random
from pandas import date_range

# Create some random data
data = [random.randint(0,10000) for x in range (10000)]
# Create datetime Index, providing start and freq
index = date_range(start='01-01-2013', periods=len(data), freq='T')
s = Series(data, index=index)


Avatar image for fd

fd on Jan. 3, 2020

as s_daily i receive just: <pandas.core.resample.DatetimeIndexResampler object at 0x11d521160> ..please advise thanks ;-)

Avatar image for fd

fd on Jan. 3, 2020

problem solved, thanks

Avatar image for Richard Obermeier

Richard Obermeier on April 26, 2020

First off here is a working snippet that works with current pieces of SW (python3, pandas):

import pandas as pd
import numpy as np
import random
from pandas import DataFrame, Series
from pandas import date_range

data = [random.randint(0,10000) for x in range(10000)]
index = date_range(start='01-01-2013', periods=len(data), freq='T')
s = Series(data, index=index)

Some more points I wanted to make

  • I doubt that the value add of a not-working tutorial compensates for the frustration and lost time of searching why it is not working
  • the expectation when coming from the other high-quality Real Python tutorials (up until now I was a big fan of it) is different. You are damaging this reputation.
  • you should at least consider putting a warning at the beginning of the tutorial and put a simple transcript there that works with the current version
Avatar image for dg73

dg73 on May 11, 2020

agree with Richard. Am new to Python and making basic mistakes at the best of times. To have to figure out what the 3 equiv of 2 is adds to the frustration. And for it not to be made clear in advance that this is for 2 is even worse. I paid the membership to save me time, not add to the aggro. Still trying to figure out pandas …

Become a Member to join the conversation.