Doing Rolling-Window Analysis
Note: Though you can’t see it onscreen, calling temp.rolling(window=3, center=True).mean()
at this point in the video will actually return two NaN
values, with the second one being the final value. The final value won’t have another value following it, and so there’s not enough data to compute the mean for that time.
00:00 Another type of operation that you may want to do with time-series data is called rolling-window analysis. Now, one of the reasons why you may want to do something like this is when you’ve got data that varies greatly in very small time intervals and you want a way to smooth out the data.
00:17 So, a common application of this is when you’ve got, say, stock prices. The data, as you know, varies quite a bit, even in very small time intervals, and if you want to get sort of a smoothed-out version of the data you may want to do what’s called rolling-window analysis.
00:34
So, the function for this or the method on the DataFrame is called .rolling()
, and what we’ll do is we need to specify the width of the window that we want to perform the aggregate function on. And in this case, what we’re going to do is the aggregate function is going to be the mean.
00:52
Let me set the keyword argument window
to 3
, and then the aggregate function that we’re going to use is called .mean()
.
01:00
Or again, you could use .min()
or .max()
just depending on what makes more sense for your application. So let me run that and then let me explain what we get.
01:10
Let’s focus on the 2 hour value that we get of 7.3
. So, we specified a window of 3
. What happens here is for the value at 2:00, the 2:00 value—the value of 7.3
—is computed by getting the values of the temperature at the two previous times, and so the 2:00 value is the right end point of the window.
01:35
We take those first three values, compute the mean, and we get 7.3
. If you want to see this explicitly, let me just comment this out, and let me call the .head()
and say just the first 3
.
01:50
If we average out the temperatures at 12:00 in the morning, 1:00 in the morning, and 2:00 in the morning—average these out—we get the 7.3
02:00 that we had over here. And then to calculate the value at 3:00 we take from the original time series data the time temperature at 1:00 in the morning, 2:00 in the morning and 3:00 in the morning, compute the mean, and in that case, we get 6.7.
02:17
So, the value that’s computed at any given time is the right end point of, in this case, a window of size 3
. Now, maybe why it now makes sense that we’re going to get NaN
values for the temperature window value at 12:00 in the morning and at 1:00 in the morning—because at 12:00 in the morning, there aren’t two values before 12:00 in the morning in the data and so there is no computation to be done.
02:41
And then likewise at 1:00 in the morning, if that’s the right end point of the window, we’ve only got one before it, and so we don’t have enough data points. Now, an alternative value that you can pass in for a keyword argument that’s called center
—the default value is False
—is to pass in a value of True
.
03:02 What this will do is instead of taking the data point where we’re going to compute a value as the right end point of the window, it’s going to be the center of the window. And so in this case, if we run this code,
03:17
we get only one NaN
value. The value at 1:00 in the morning, the way it’s computed is the 1:00 in the morning value of the original data is the middle value, and so the 12:00, 1:00, and 2:00 values are used to compute the mean at 1:00 in the morning, which we get 7.3. And then the only NaN
value there in this case is going to be at 12:00 in the morning, because even though we do have a value at 12:00 and at 1:00, we don’t have a value before, and so in that case, we get a NaN
value.
03:48
So again, a reason why you may want to do rolling-window analysis and, for example, use the .mean()
function as the aggregate function is to smooth out the data.
03:58 So if you’ve got stock prices that vary wildly in a very short amount of time, a way to smooth out the data, or even, say, with a frequency of days… In a given day, from one day to the other, stock prices vary wildly.
04:12
If you want to smooth this out, you can do a rolling-window analysis where the aggregate function is the .mean()
function. All right! So those are just a couple of the methods that you can use on time-series data.
04:25 Now let’s take a look at the many ways that we can visualize data in a pandas DataFrame.
Become a Member to join the conversation.