Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Missing Data?

I know not every stock has data for every minute of every day but this is starting to look like a gap in the data. Is there any reason for this of which I'm unaware?

SID(24)

2013-01-03PRINT2013-01-03 18:36:00+00:00 , 547.69  
2013-01-03PRINT2013-01-03 18:37:00+00:00 , 547.64  
2013-01-03PRINT2013-01-03 18:52:00+00:00 , 547.72  
2013-01-03PRINT2013-01-03 18:53:00+00:00 , 547.6  

SID(114)

2013-01-03PRINT2013-01-03 18:36:00+00:00 , 38.06  
2013-01-03PRINT2013-01-03 18:37:00+00:00 , 38.05  
2013-01-03PRINT2013-01-03 18:52:00+00:00 , 38.06  
2013-01-03PRINT2013-01-03 18:53:00+00:00 , 38.05  

My aggregating minute prices into 30-minute bars was coming along quite well until this happened....

P.

5 responses

Hello Quantopian,

I've been using a batch_transform in a minutely algo and I've wasted HOURS debugging it. This may hint at the problem. It's....err....proper wack, as the kiddies say. Caused by the missing 14 minutes detailed in the post above and poor data handling in batch_transform.

P.

Code:

def initialize(context):  
    context.test_sid=sid(24)  
    context.error = False

def handle_data(context, data):  
    prices=get_data(data, context)  
    if prices is None:  
        return  
    if len(prices) < 780 and not context.error:  
        print "In hamdle_data length of prices is " + str(len(prices))  
        print prices.ix[0]  
        print prices.ix[389]  
        print prices.ix[765]    # Amy higher is out of bounds  
        context.error = True  
    elif len(prices) == 780:  
        print "In hamdle_data length of prices is " + str(len(prices))  
        print prices.ix[0]  
        print prices.ix[389]  
        print prices.ix[779]  
@batch_transform(window_length=2, refresh_period=0)  
def get_data(datapanel, context):  
    prices = datapanel['price']  
    if not context.error:  
        print "In batch_transform length of prices is " + str(len(prices))  
    return prices  

Output:

2013-01-03PRINTIn batch_transform length of prices is 766  
2013-01-03PRINTIn hamdle_data length of prices is 766  
2013-01-03PRINT24 552.97 Name: 2013-01-02 14:31:00+00:00, dtype: float64  
2013-01-03PRINT24 548.71 Name: 2013-01-02 21:00:00+00:00, dtype: float64  
2013-01-03PRINT24 542.38 Name: 2013-01-03 21:00:00+00:00, dtype: float64  
REM ** The three lines above are quite different to the three lines below. I think this is **  
REM ** due to the 14 minute period in between whilst the DataFrame returned by the batch_transform **  
REM ** grows from 766 to 780 rows. This is why the first two lines are the same but the **  
REM ** third one changes from 21:00:00 to 14:44:00 i,e, 14 data minutes later. **  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 552.97 Name: 2013-01-02 14:31:00+00:00, dtype: float64  
2013-01-04PRINT24 548.71 Name: 2013-01-02 21:00:00+00:00, dtype: float64  
2013-01-04PRINT24 536.75 Name: 2013-01-04 14:44:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 551.09 Name: 2013-01-02 14:32:00+00:00, dtype: float64  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT24 537.14 Name: 2013-01-04 14:45:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 551.6 Name: 2013-01-02 14:33:00+00:00, dtype: float64  
2013-01-04PRINT24 545.33 Name: 2013-01-03 14:32:00+00:00, dtype: float64  
2013-01-04PRINT24 536.87 Name: 2013-01-04 14:46:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 552.07 Name: 2013-01-02 14:34:00+00:00, dtype: float64  
2013-01-04PRINT24 545.56 Name: 2013-01-03 14:33:00+00:00, dtype: float64  
2013-01-04PRINT24 536.96 Name: 2013-01-04 14:47:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 553.5 Name: 2013-01-02 14:35:00+00:00, dtype: float64  
2013-01-04PRINT24 544.7401 Name: 2013-01-03 14:34:00+00:00, dtype: float64  
2013-01-04PRINT24 537 Name: 2013-01-04 14:48:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 554.09 Name: 2013-01-02 14:36:00+00:00, dtype: float64  
2013-01-04PRINT24 545.11 Name: 2013-01-03 14:35:00+00:00, dtype: float64  
2013-01-04PRINT24 536.22 Name: 2013-01-04 14:49:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 553.61 Name: 2013-01-02 14:37:00+00:00, dtype: float64  
2013-01-04PRINT24 543.85 Name: 2013-01-03 14:36:00+00:00, dtype: float64  
2013-01-04PRINT24 536.322 Name: 2013-01-04 14:50:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 553.26 Name: 2013-01-02 14:38:00+00:00, dtype: float64  
2013-01-04PRINT24 544.29 Name: 2013-01-03 14:37:00+00:00, dtype: float64  
2013-01-04PRINT24 535.79 Name: 2013-01-04 14:51:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 552.76 Name: 2013-01-02 14:39:00+00:00, dtype: float64  
2013-01-04PRINT24 544.87 Name: 2013-01-03 14:38:00+00:00, dtype: float64  
2013-01-04PRINT24 536.19 Name: 2013-01-04 14:52:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 553 Name: 2013-01-02 14:40:00+00:00, dtype: float64  
2013-01-04PRINT24 544.625 Name: 2013-01-03 14:39:00+00:00, dtype: float64  
2013-01-04PRINT24 535.37 Name: 2013-01-04 14:53:00+00:00, dtype: float64  
2013-01-04PRINTIn hamdle_data length of prices is 780  
2013-01-04PRINT24 552.59 Name: 2013-01-02 14:41:00+00:00, dtype: float64  
2013-01-04undefined:undefinedWARNLogging limit exceeded; some messages discarded  

Obviously that is a bit confusing if you haven't - like me - spent about 12 hours investigating the data. It should look like this:

P.

2013-02-04PRINTIn batch_transform length of prices is 780  
2013-02-04PRINTIn hamdle_data length of prices is 780  
2013-02-04PRINT24 458.508 Name: 2013-02-01 14:31:00+00:00, dtype: float64  
2013-02-04PRINT24 453.76 Name: 2013-02-01 21:00:00+00:00, dtype: float64  
2013-02-04PRINT24 442.22 Name: 2013-02-04 21:00:00+00:00, dtype: float64  
2013-02-05PRINTIn batch_transform length of prices is 780  
2013-02-05PRINTIn hamdle_data length of prices is 780  
2013-02-05PRINT24 457.12 Name: 2013-02-01 14:32:00+00:00, dtype: float64  
2013-02-05PRINT24 454.2 Name: 2013-02-04 14:31:00+00:00, dtype: float64  
2013-02-05PRINT24 444.03 Name: 2013-02-05 14:31:00+00:00, dtype: float64  
2013-02-05PRINTIn batch_transform length of prices is 780  
2013-02-05PRINTIn hamdle_data length of prices is 780  
2013-02-05PRINT24 458.39 Name: 2013-02-01 14:33:00+00:00, dtype: float64  
2013-02-05PRINT24 455.36 Name: 2013-02-04 14:32:00+00:00, dtype: float64  
2013-02-05PRINT24 442.78 Name: 2013-02-05 14:32:00+00:00, dtype: float64  
2013-02-05PRINTIn batch_transform length of prices is 780  
2013-02-05PRINTIn hamdle_data length of prices is 780  
2013-02-05PRINT24 457.74 Name: 2013-02-01 14:34:00+00:00, dtype: float64  
2013-02-05PRINT24 454.8 Name: 2013-02-04 14:33:00+00:00, dtype: float64  
2013-02-05PRINT24 443.37 Name: 2013-02-05 14:33:00+00:00, dtype: float64  
2013-02-05PRINTIn batch_transform length of prices is 780  
2013-02-05PRINTIn hamdle_data length of prices is 780  
2013-02-05PRINT24 456.87 Name: 2013-02-01 14:35:00+00:00, dtype: float64  
2013-02-05PRINT24 454.95 Name: 2013-02-04 14:34:00+00:00, dtype: float64  
2013-02-05PRINT24 445.06 Name: 2013-02-05 14:34:00+00:00, dtype: float64  
2013-02-05PRINTIn batch_transform length of prices is 780  
2013-02-05PRINTIn hamdle_data length of prices is 780  
2013-02-05PRINT24 456.91 Name: 2013-02-01 14:36:00+00:00, dtype: float64  
2013-02-05PRINT24 455.3 Name: 2013-02-04 14:35:00+00:00, dtype: float64  
2013-02-05PRINT24 446.05 Name: 2013-02-05 14:35:00+00:00, dtype: float64  
2013-02-05PRINTIn batch_transform length of prices is 780  
2013-02-05PRINTIn hamdle_data length of prices is 780  
2013-02-05PRINT24 455.77 Name: 2013-02-01 14:37:00+00:00, dtype: float64  
2013-02-05PRINT24 453.9 Name: 2013-02-04 14:36:00+00:00, dtype: float64  
2013-02-05undefined:undefinedWARNLogging limit exceeded; some messages discarded  

With the problem day included in the algo the DataFrame returned by the batch_transform grows for the first 14 minutes:

2013-01-03PRINTIn batch_transform length of prices is 766  
2013-01-04PRINTIn batch_transform length of prices is 767  
2013-01-04PRINTIn batch_transform length of prices is 768  
2013-01-04PRINTIn batch_transform length of prices is 769  
2013-01-04PRINTIn batch_transform length of prices is 770  
2013-01-04PRINTIn batch_transform length of prices is 771  
2013-01-04PRINTIn batch_transform length of prices is 772  
2013-01-04PRINTIn batch_transform length of prices is 773  
2013-01-04PRINTIn batch_transform length of prices is 774  
2013-01-04PRINTIn batch_transform length of prices is 775  
2013-01-04PRINTIn batch_transform length of prices is 776  
2013-01-04PRINTIn batch_transform length of prices is 777  
2013-01-04PRINTIn batch_transform length of prices is 778  
2013-01-04PRINTIn batch_transform length of prices is 779  
2013-01-04PRINTIn batch_transform length of prices is 780  
2013-01-04PRINTIn batch_transform length of prices is 780  
2013-01-04PRINTIn batch_transform length of prices is 780  

My feeling is that this a problem but I seem to be talking to myself....maybe no one is using batch_transform with minute data?

I worried this morning that the 'clean_nans' parameter was the cause and to be honest I had forgotten all about this option. I've tried setting it to both True (default) and False i.e.

@batch_transform(window_length=1, refresh_period=0, clean_nans=False)  

but it makes no difference. What I now see is that with a short DataFrame returned by batch_transform the first price repeats in the algo 1 + No. Of Missing data items times. So with my 14 missing minutes the same price (and the same time) is processed in the algo 15 times:

2013-01-03PRINT Error - Missing Data!376  
2013-01-03PRINT0  
2013-01-03PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT1  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT2  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT3  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT4  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT5  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT6  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT7  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT8  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT9  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT10  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT11  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT12  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT13  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT14  
2013-01-04PRINT24 547.158 Name: 2013-01-03 14:31:00+00:00, dtype: float64  
2013-01-04PRINT15  
2013-01-04PRINT24 545.33 Name: 2013-01-03 14:32:00+00:00, dtype: float64  
2013-01-04PRINT16  
2013-01-04PRINT24 545.56 Name: 2013-01-03 14:33:00+00:00, dtype: float64  

I don't understand why this hasn't been noticed before.

P.

Hi Peter - the missing gap in prices is quite real. There was a big outage that day.

That said - it sounds like you're getting some unexpected behavior within that gap. Can you explain that for me?

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hello Dan,

(i) the DataFrame returned by batch_transform is short in a minutely algo if there is missing data. It then grows once per minute until it is 390 * Windows_Length rows. I think a better solution would be to insert the missing timestamps in the index with NaNs as the prices.

(ii) as detailed in a post above the first time/price is processed in the algo multiple times i.e. 1 + No.Of Missing Minutes times.

(iii) Actions such as DataFrame.resample() on the short DataFrame produce unexpected results such as generating an extra time period i.e. 261 30-minute time periods from a 20-day window.

(EDIT: without thinking it through too much I would say that (i) and (ii) occur if the missing data is in the first 390 * Windows_Length minutes of the backtest period. I can't predict what happens if it occurs later.)

P.