Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Python binning?

Perhaps someone can provide guidance on the best way to implement the binning described in "3.1 Dimensionality Reduction Via PAA" in the paper below. For example, starting with minute-level data, I'd like to create 15 minute wide bins, compute the average for each bin, and store the result in a vector. Additionally, it would be nice to store a corresponding datetime stamp centered on each bin, but this is not absolutely necessary. As an initial go, it would be fine to ignore the datetime stamps (as an approximation, ignore that the market is closed evenings/weekends/holidays).

Does Pandas have a convenient way of doing this, so that I could keep the data returned by the batch transform in its native format? Numpy/scipy? Other?

I'm interested in pattern recognition in time series, and as discussed in the paper, the binning is the first step in coding the time series.

Lin, Jessica, et al. "A symbolic representation of time series, with implications for streaming algorithms." Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery. ACM, 2003.
www.cs.ucr.edu/~stelo/papers/DMKD03.pdf

10 responses

Might be some guidance here:

http://pandas.pydata.org/pandas-docs/dev/computation.html

Rolling statistics with a window could do the trick. I'll give it a try when I get the chance.

This can be done using pandas up- and downsampling

df.resample('15T', how='mean')  

Thanks Aidan,

To use the up- and down-sampling, it appears that I'd need to get the Quantopian data into a Pandas time series structure, correct? If you know how to whip together an example, I'd appreciate it. Otherwise, I'll fiddle around with it at some point.

Grant

The data passed to the batch transform is a pandas Series so you can simply call the resample method. Here's a quick backtest to demonstrate, be sure to run it on minutely data!

Thanks Aidan,

The resample method sorta works. For the attached backtest, I obtained this log output (I switched to a 5-minute period):

2010-01-04PRINT----- First datetime stamp -----
2010-01-04PRINT2010-01-04 14:31:00+00:00
2010-01-04PRINT----- 1 Minute Data -----
2010-01-04PRINT2010-01-04 14:31:00+00:00 112.415 2010-01-04 14:32:00+00:00 112.450 2010-01-04 14:33:00+00:00 112.451 2010-01-04 14:34:00+00:00 112.410 2010-01-04 14:35:00+00:00 112.430 Name: 8554, dtype: float64
2010-01-04PRINT----- 5 Minute Data -----
2010-01-04PRINT2010-01-04 14:30:00+00:00 112.4315 Freq: 5T, dtype: float64
2010-01-04PRINT----- Current datetime -----
2010-01-04PRINT2010-01-04 21:00:00+00:00
2010-01-04PRINT----- Downsampled data -----
2010-01-04PRINT2010-01-04 14:30:00+00:00 112.4315

First, when a manually take the mean for the first five minutes, I obtain 112.4312. The resample method gives 112.4315. Why the difference?

Also, the time stamp for the first entry in the downsampled data is 14:30. Shouldn't it be 14:31?

Grant

First, when a manually take the mean for the first five minutes, I obtain 112.4312. The resample method gives 112.4315. Why the difference?

This is due to how the resample method closes the bin interval, you can set it to left or right. The difference is shown in the output of the attached backtest:

2010-01-04PRINT----- 5 Minute Data Closed left -----  
2010-01-04PRINT2010-01-04 14:30:00+00:00    112.4315  
Freq: 5T, dtype: float64  
2010-01-04PRINT----- 5 Minute Data NumPy Mean [0:4] -----  
2010-01-04PRINT112.4315  
2010-01-04PRINT----- 5 Minute Data Close right -----  
2010-01-04PRINT2010-01-04 14:30:00+00:00    112.4312  
Freq: 5T, dtype: float64  
2010-01-04PRINT----- 5 Minute Data NumPy Mean [0:5] -----  
2010-01-04PRINT112.4312  

Also, the time stamp for the first entry in the downsampled data is 14:30. Shouldn't it be 14:31?

For frequencies that evenly subdivide 1 day, by default resample uses the first of the aggregated intervals. This behaviour can be changed using the base keyword, for example, for '5min' frequency, base could be 0 through 4. In your case, setting the base=1 will make the first bin at 14.31.

010-01-04PRINT----- 5 Minute Data with Base Offset -----  
2010-01-04PRINT2010-01-04 14:31:00+00:00    112.4312  

Thanks...Grant

Hi Aidan,

I added a line:

print '----- 5 Minute Rolling Mean -----'  
print pd.rolling_mean(data.price[sid],5)  

This may be what I'm looking for, rather than resample. Here's the output:

2010-01-04PRINT----- 5 Minute Rolling Mean -----  
2010-01-04PRINT2010-01-04 14:31:00+00:00 NaN 2010-01-04 14:32:00+00:00 NaN 2010-01-04 14:33:00+00:00 NaN 2010-01-04 14:34:00+00:00 NaN 2010-01-04 14:35:00+00:00 112.4312 2010-01-04 14:36:00+00:00 112.4642 2010-01-04 14:37:00+00:00 112.4982 2010-01-04 14:38:00+00:00 112.5400  

Seems more appropriate for a streaming algorithm, since the datetime stamps automatically correspond to the instant the information is available.

Grant

You might need to be a little careful here as rolling_mean and resample give different answers. rolling_mean with window=5 takes the mean of the last 5 values at each timestamp. resample slices the data in into 5 period chunks and calculates the mean of each chunk. From the application description in the first post, I thought resample was more appropriate :)

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(range(10), index=pd.date_range('2013-1-1', periods=10), columns=['A'])

In [3]: df  
Out[3]:  
            A  
2013-01-01  0  
2013-01-02  1  
2013-01-03  2  
2013-01-04  3  
2013-01-05  4  
2013-01-06  5  
2013-01-07  6  
2013-01-08  7  
2013-01-09  8  
2013-01-10  9

In [4]: df.resample('2D', how='mean')  
Out[4]:  
              A  
2013-01-01  0.5  
2013-01-03  2.5  
2013-01-05  4.5  
2013-01-07  6.5  
2013-01-09  8.5

In [5]: pd.rolling_mean(df, 2)  
Out[5]:  
              A  
2013-01-01  NaN  
2013-01-02  0.5  
2013-01-03  1.5  
2013-01-04  2.5  
2013-01-05  3.5  
2013-01-06  4.5  
2013-01-07  5.5  
2013-01-08  6.5  
2013-01-09  7.5  
2013-01-10  8.5  

Thanks Aidan,

You're correct that the resample method is directly appropriate for the application I describe above. As your example shows, rolling_mean provides the same binned data as resample (0.5, 2.5, 4.5, ...), interspersed with intermediate values (1.5, 3.5, 5.5, ...). I need to do a little homework, but I think that the statistics should work out the same for the algorithm I have in mind.

Grant