Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
fetch_csv() date details: Live Trading vs Back Testing

Within the "Working With Multiple Data Frequencies" section of the docs, it notes

When pulling in external data, you need to be careful about the data
frequency to prevent look-ahead bias. If you are fetching daily data,
but are running a minutely-based algorithm, the daily row will be
fetched at the beginning of the day, instead of the end of day. To
guard against the bias, you need to use the post_func function.

I'm trying to accurately back test an algo which fetches a CSV with signal data. The CSV has several rows per date (not time-granular), including historical data. When live trading, another system will add a few rows of data just before midnight Eastern Time. This means that the next day, the algo will grab the CSV and only have data up through the previous day. For example, when the CSV gets updated by the external system on 7/28/2015 at 11:58pm EST, it will make rows with a datestamp "7/28/2015" in the CSV rows. Tomorrow, 7/**29**/2015, the algo will get the CSV and only see signal data up to the 28th.

When back testing, the algo will have signal data available for each trading day - not just all the days up to the trading day. My question is: When back testing (minutely, not daily), does each trading day use CSV day up to and including the rows with the current trading day's date? Or only up to but not including the trading date?

The "Using Fetcher to create a custom universe" section of the docs implies each day's custom universe will be autofilled from past data if no data is available for the trading day. In live trading, this will be the case ever day (only yesterday's data is available). In back testing, there are no missing days of data, so I'm not sure how exactly look-ahead bias is prevented.

5 responses

It is, unfortunately, up to and including. If you want to do the right thing, you need to shift the data. Furthermore, in live trading, you need to populate the "today" date from within the csv pre/postfunc, which is nontrivial to discover.

I used the Quantopian trading calendar functionality to find the trading day following the last date of the csv, then do a .shift(1) to bring everything forward. I believe this is to produce correct results in both back testing and live trading.

Also I am not sure what they are talking about with the back filling, it was my experience that if the csv did not have data for precisely today, in live trading, it'd throw an exception.

That's good information. Would you be willing to share that pre/postfunc code?

Thanks in advance.

Thanks @Simon. That's unfortunate as the freshest data one can provide his algo for a trading day is 9.5hrs old! (CSV's due at midnight, and trading starts 9:30am).

@Klon, the time-shifting happens by setting the post-func of fetch_csv to something a function like:

def post_func(df):  
    df = df.thift(1, freq='D')  
    return df  

The D means it shifts time up one day. The Quantopian documentation shows using b which means to shift to next business day. For more details see the pandas documentation for tshift offeset frequencies.

I can post what I do in a couple of weeks!