Draft Proposal of New Data History API

Sep 27, 2013

Thanks Eddie,

I'll take a look as time allows.

Grant

Eddie,

It appears that you'll be addressing the "warm-up" issue with backtesting:

Daily History and Back Test Start

Unlike batch transforms, daily history is available on the first day of the backtest, data availability permitting.

Unlike batch transforms, the data history does not require the algorithm to run for the number of days the window before returning a value.

The data is backfilled so that calculations can be done more immediately.

Will this also apply to minute-level backtesting (since you only refer to "daily history")?

Grant

Grant, yes the proposal for back filling of data will apply to both minute-bar and day-bar history.

(Just edited to the document to remove the prefix of 'daily', which had been an artifact of a previous organization of the document.)

Disclaimer

Thanks Eddie,

Regarding your forward filling, it seems that the way you are structure the data could be different, which would avoid the problem of empty bars for Security A relative to Security B. As an option (or an example using pandas functionality) wouldn't it be possible to provide the data as independent pandas time series? This would be more consistent with the idea that the backtester is "event-driven" with each security having its own stream of events. By forward filling, you are creating events where there were none historically.

I'm not sure if you made it clear in your document, but as I understand, the filling is only done for the multiple security case. If the dataframe only contains one security, there is no way to know if bars are missing, correct? Or are you keeping track of wall clock time, to determine if minutes are skipped (or days, in the case of daily trading)?

Also, what is the remedy if the first bar is empty? The filling will fail, right? Should there be a warning?

Grant

It makes sense to settle on pandas dataframes for time series data. "5m" for five minutes of data seems a bit strange, anyone who's used any other data platform may wonder, "but how much five-minute bar data is there?" Ie - that notation is usually used for periodicity, not for backfill.

Also re forward filling, are you forward filling a nil tick to produce a degenerate bar with 0 volume? Or are you actually forward filling whatever the last bar was, causing any code that has exponential decay, volume accumulation or really any calculation at all to change...?

My first thought was the same as Simon's. It might be better for

data.history.minutes('15m', 'price')

to be

data.history.minutes(15, '1m', 'price')

which also makes

data.history.minutes(26, '15m', 'price')

possible in the future.

I don't agree with Grant on the 'independent' time series. I think a DataFrame of one SID or of many SIDs should have all missing timestamps inserted.

I see Simon's point on forward filling and I would suggest the default is to fill with NaNs, To answer Grant a missing first bar can be filled from the day before so maybe the earliest backtest start date becomes 01/04/2002 instead of 01/03/2002.

It all looks very promising.

Hello Peter,

We'll see what the response is, but my guess is that the proposed code will not fill forward missing bars if only one sid is specified.

Regarding the independent time series, by sid, I think that so long as the filling can be turned off, then one can always generate the time series that I'm talking about, with no filling (but with irregularly spaced datetime stamps).

Grant

Sep 29, 2013

That's interesting. I also thought to implement myself a kind of history object for a Python backtester

My idea was to use a kind of rolling DataFrame (like a circular buffer) to store OHLCV price and volume for differents timeframe.

I have some difficulties to understand how to access to price history of a given sid

I also don't understand what happens (in term of memory usage) for a given strategy which use several sid
prices_a = data.history.minutes('15m', 'price')

Does it means that you are building a 15 minutes OHLCV internal dataframe for each sid ?
What is the length of this DataFrame (does size inflates ? if no... how DataFrame size can be set ?)

I also wonder how to access to "latest_price" and "latest_volume" for a given sid into handle_data

Is there a way to feed this object to have both access to bid price and ask price ?
(and also spread ?)

How to access to "last_datetime" ? (it could be useful if, for example, your strategy only works on friday.)
If you want to turn live your strategy you will need to know current datetime into handle_data

A good way to test this history object could be to make a python / pandas / zipline script which build into memory history object for a given sid using (live) tick datafeed.
An other script could display candlesticks using Matplotlib for a given timeframe.
These 2 scripts could communicate each other using IPC or socket or much better...
a messaging system (like ZMQ, RabbitMQ...)
See for example
http://www.youtube.com/watch?v=BXJelQyIYnY
it's using IPC...

About TALib - https://github.com/mrjbq7/ta-lib/tree/master/talib
You can now (12 days ago) directly feed TALib abstract function with a Pandas DataFrame
import talib
from talib import abstract
abstract.SMA(df)

A last comment:
maybe you should also think your history object as easily improvable to manage other data (not only OHLCV) but for example RangeBar or Renko charts.

PS: please excuse my poor english

Sep 29, 2013

Perhaps you've thought of it, but I'd have a look at a couple cases:

What happens when you feed asynchronous data (e.g. datetime stamps that don't fall on the whole minute)?
You've talked about adding other types of securities/futures/etc. (e.g. Forex). In general, it seems that you'll need to deal with 24/7/365 asynchronous data feeds, right?
I agree with Simon's sentiment above that you need to provide exact details on how the filling will be done. Also, if users want do do their own filling, you might provide details.
Are you planning to provide guidance on how to generate a length N trailing window of OHLCV bars (e.g. 15 minute) on a rolling basis, updated every minute? This would basically be a specialized moving (rolling) statistic like the ones described on http://pandas.pydata.org/pandas-docs/dev/computation.html.

Grant

Sep 29, 2013

Hello Grant,

I assume the answer to 4. would be:

N = 15    # window length  
strNminutes = str(N) + 'm'

def handle_data(context, data):  
    price_history = data.history.minutes(strNminutes, 'price')

Elegant! But only the closing price. Repeat for O, H, L and V.

Peter,

Perhaps I wasn't clear on my #4 above. If you have a look at the moving (rolling) statistics available in pandas (http://pandas.pydata.org/pandas-docs/dev/computation.html), they actually return a new pandas time series, with statistics applied over a rolling window. I think that the proposed new functionality would just return the data over the last N minutes. For example, this would yield 15 minutes of price data:

minute_prices = data.history.minutes('15m', 'price')

I'm thinking of a rolling_bar function, in analogy, for example, to the rolling_mean function described on http://pandas.pydata.org/pandas-docs/dev/computation.html under "Moving (rolling) statistics / moments." Basically, there ought to be a way to generate OHLCV data on a rolling basis from the original minute-level data. Note that this is different from a trailing window of 15 OHLCV minute-level bars.

Are you perhaps suggesting the same thing above with:

data.history.minutes(26, '15m', 'price')

Would this return a pandas time series, with 26 rows, each containing a 15-minute OHLCV re-sampled bar?

The Quantopian folks provide some guidance in their draft under the "Resampling" section, but it seems that the whole approach could be re-cast into the rolling_bar statistic that I'm suggesting to yield the OHLCV data with one call.

Grant

Hi all, First off – thank you for your questions and suggestions, your feedback is important to us and we definitely want to get this feature right! Let me speak to the two major themes you have all gravitated towards first. Then, I have several individual replies. We will continue to update this thread throughout the week, as well as updating the documentation to reflect changes as they are made.

(1) Filling in for missing data – you will have the following options to handle missing data points:

'ffill=False', would have the same DatetimeIndex as if the stock were fully liquid (i.e. the minutes and days would be all of the market open minutes and days), with the empty minutes or days having a 'np.nan' for all of the OHLCV values.

'ffill=True' - fill the pricing information forward with the last known value (with the exception of inserting a "0" for volume)

In the case that there is no last known value the time series will be padded with leading nans, until a price for that stock occurs, after which the behavior will be defined by the options above.

(2) Syntax for specifying frequency vs. duration of historical data – there was much internal debate on the right syntax for a user to separately specify both the frequency and the duration (or window length) of the ‘history’ object – and it’s very possible we’re still not quite there – so suggestions on that front are very welcome.

Our goal is not to recreate or supersede pandas elegant resampling functionality – but rather to give you the ability to define upfront how much data your algo will need – and let you code from there. The simplest thing to do would be to always dump the highest frequency data we allow (minutely) and only ask you how far back you want to look (15 minutes, 2 days etc.). The problem we saw there was that many users have algos that operate on a much longer timeframe and across a large universe of securities, and saddling those folks with minutely data to constantly be discarding seemed onerous. So we opted to provide the ability for your algorithm to ask for EITHER daily or minutely data (that’s the frequency piece) and then separately to define how far back you want to look.

Now for the one-offs:
@WorkingForCoins with your questions about the most recent bar, there is a section in the document about getting the most recent bar and those previous to it here, https://github.com/quantopian/quantopian-drafts/blob/master/data-history-draft.md#current-and-previous-bar
In addition, there is a method, currently available on the live site, called "get_datetime" which returns the minute that "handle_data" is handling.
Is that what you were looking for?

Also, WorkingForCoins that new feature of TALib with direct handling of DataFrames is good news. I just opened this issue in Zipline to keep track of it, https://github.com/quantopian/zipline/issues/222

@Grant – in the interest of robustness we have decided not to handle each security’s data as an independent time series – but rather to pad the historical data window specified so that each minute of data is represented regardless of whether there are 1, 10 or 100 securities in the universe.

One way to get your desired stream of an individual stock's history would be to use the pd.Series method "valid()", which returns all non-nan values.

history = data.history.minutes("50m", 'price')

for s in data:  
    sid_history = history[s].valid()

Disclaimer

Thanks Eddie,

With regard to your "(1) Filling in for missing data..." above, it sounds that even for the case of one security, you'll be filling empty minutes with NaN's, correct?

Grant

Grant, it is correct that there will be values for every market minute, irrespective of the number of securities being used by the algorithm.
One distinction is that the "ffill" flag will determine whether that filling is done with "np.nan"s or forward-filled with the previously seen price.

Disclaimer

A few corner cases to consider:

Trading in a given security is suspended, but the market in which it trades is still open. Presumably, the filling rules still apply, correct?
What if the market shuts down during a normal trading day (e.g. a computer glitch). Would all securities get OHLCV data filled during the shutdown? Or would the market minutes get skipped for all securities?

Grant

Hello Eddie,

As as said it all looks good. But as the '15m' in

minute_prices = data.history.minutes('15m', 'price')

is very different to the '15Min' in

converted = ts.asfreq('15Min', method='pad')

please consider something like

minute_prices = data.history.minutes(15, '1Min', 'price')

Under "TALib Port" it is is not clear how the TA-Lib parameters are set.

def initialize(context):  
    set_universe(DollarVolumeUniverse(95, 95.1))

def handle_data(context, data):  
    price_history = data.history.days('30d', 'price')  
    macd_result = talib.MACD(price_history)

If the defaults are not used, do the parameters get set within the call to MACD in handle_data?

Grant

I don't know a lot about TA-Lib port into Zipline (which will be probably improved)
But what I can say is that with latest TA-Lib Python port you can do

abstract.MA(df, timeperiod=3, matype=0)

But abstract.MA(df, {'timeperiod': 3, 'matype': 0}) doesn't work

If parameters are stored in a dict you need to unpack this dict like

abstract.MA(df, **{'timeperiod': 3, 'matype': 0})

Thomas Wiecki

As to getting 15 minutes OHLC, this should be fairly easy:

data.history.minutes('60m', 'price').resample('15m', how='ohlc')

Should give you the OHLC of the last hour for each 15m bin (so a dataframe with 4 rows). Is that an acceptable way?

I think we should try to leverage pandas as much as possible here as it comes with great resampling capabilities instead of replicated it in the Quantopian-API.

Disclaimer

Again, when I read that, I instinctively think, "How can they resample 60m bars into 15m bars?", before realizing again that you've used "60m" to refer to a length of time, rather than a period.

data.history.minutes(60, 'price').resample('15m', how='ohlc')

Hello Thomas,

That makes sense but it has the potential to confuse i.e.'m' and 'Min' have to be used in the same line:

data.history.minutes('60m', 'price').resample('15Min', how='ohlc')

which I think is the correct syntax. And how do we parameterise these? This

lookBackDays = 10  
strPeriod = str(lookBackDays * 390)  
strBarSize = '15Min'  
data.history.minutes(strPeriod, 'price').resample(strBarSize, how='ohlc')

is messy. I don't know enough pandas - is this a pandas trait?

Jessica Stauth

Simon - definitely agree it's not perfection, but we did bat around a number of alternatives which all had problems. How would you expect it to look?

Would you prefer something like:
data.history(frequency='minutes', duration='60', value='price')
where frequency can be 'minutes' or 'days', duration can be any integer, and value can be price, volume, etc?

or do you want to ask for any arbitrary 'bar size' upfront like this:
data.history(barsize="15m", count=30, value='price')

And my only issue with option #2 is that it feels like we're rolling a lot of duplicative resampling functionality inside the API, where it might be more flexible to leave it up to the user to do this in pandas. But I might be throwing the baby out with the bathwater on that concern - would love your thoughts.

Jess

Disclaimer

Well, I guess the answer to that question depends on, "Who are your users?" :)

Another question I would pose, rhetorically, is, "Who really wants daily data in a minutely backtest/live trading anyway?" That is to say, if I have a daily strategy, ought I expect implicitly that it be evaluated exactly once a day?

If I were a TA man, coding a 50day > 200day moving average crossover, would I ever want that inequality tested every minute?

If the answers to those questions are "nobody", "yes" and "no", then your API requirements are simplified:

data.history(length=60, frequency='1m')

Just a thought.

EDIT - the above isn't "simplified", my mistake. What I meant was, that the the bar frequency would drive the simulation or vice versa, so that it would be implicit.

Thomas Wiecki

@Peter: Yes, you are correct. The syntax should be the same in both cases.

I think Simon has a point. I am not sure we really need the frequency argument, unless I'm misunderstanding. Would frequency='5m' only give you every fifth minute or do OHLC over 5 min bins?

Also, when one wants day OHLC in minute mode one could do:

data.history(length=10*6.5*60).resample('1D', how='ohlc')

Simon's example would become:
data.history(length=60).resample('5Min') To only keep every fifth minute.

Disclaimer

I think you very rarely want every fifth minute, one expects it will be the OHLC of the five minutes.

Thomas Wiecki

Simon: So are you proposing of integrating .resample(freq, how='ohlc') into data.history()? I.e.

data.history(length=60).resample('5Min', how='ohlc') == data.history(length=60, frequency='5Min')

Disclaimer

If you are going to provide 5 minute bars in a simulation, they should be a bar for the entire previous five minutes, not the fifth minute bar with the previous four ignored, that was my point there.

More generally, I think you should consider whether the flexibility to request data of different frequencies than that in which the algorithm is being evaluated is worthwhile. If not, then you can just fix a strategy as "Daily" or "Minutely", feed it daily or minute bars once per period and forget about the frequency argument entirely, with the only parameter being how far back the data has to go.

Hello all,

Some feedback (perhaps re-iterating some comments above):

The fields/flags and their syntax should be consistent across Quantopian and pandas.
Would it be possible to just keep things simple, and provide data.history(length=N), and then provide examples of how to apply pandas? N would correspond to the number of market tics when the security could have traded (N days or N minutes, depending on the base frequency of the backtest). The trailing window of data would be available at the start of the backtest (rather than after a "warm-up" period, as presently required). By the way, I'm not sure that this would be any different than the current batch transform with refresh_period = 0, so I've been wondering (aside from eliminating the warm-up issue), why you are adding the history functionality.
It is still kinda murky what happens when Quantopian includes markets other than U.S. equities (e.g. foreign). It seems that the data structure won't work since the datetime stamps won't align (e.g. Asian, European, and U.S. markets all have different open/close times). It seems that the database should have another flag, besides the OHLCV data, to indicate whether the market was open for a given security, right?

Grant

Hello all too,

I don't understand necessity of data alignment and why you want not to have a "rolling OHLCV Dataframe" for each timeframe / symbol
I may be wrong but this is how I would consider this problem

I will first in initialize method create variables to manage history
Let's say that each rolling DataFrame will have 65535 points history depth and that my strategy need to store 1 minute timeframe and a daily timeframe for both EURUSD and EURCHF

Internal history object could be a dict like this:

{  
  'EURUSD': {  
      'bid': value,  
      'ask': value,  
      'last': value,  
      'last_vol': value,  
      'last_type': ['buy'|'sell']  
      'tick': <rolling_tick_data>  
      'OHLCV': {  
          '1Min': {  
              'bid': <rolling_ohlc_data>,  
              'ask': <rolling_ohlc_data>  
          }  
          '5Min': {  
              ...  
          },  
              ...  
          '15Min': {  
              ...  
          },  
      }  
      ,

   'EURCHF': {  
      'bid': value,  
      'ask': value,  
      'last': value,  
      'last_vol': value,  
      'last_type': ['buy'|'sell']  
      'tick': <rolling_tick_data>  
      'OHLCV': {  
          '1Min': {  
              'bid': <rolling_ohlc_data>,  
              'ask': <rolling_ohlc_data>  
          }  
          '5Min': {  
              ...  
          },  
              ...  
          '15Min': {  
              ...  
          },  
      }  
}

Initialize function will contains lines like:

N = 65535  
data.history.store('EURUSD') # will create 'EURUSD' key into history dict and every other sub-keys  
data.history.create_tick_buffer('EURUSD', N)  
data.history.create_ohlcv_buffer('EURUSD', '1Min', N)  
data.history.create_ohlcv_buffer('EURUSD', '1D', N)

data.history.store('EURCHF')  
data.history.create_tick_buffer('EURCHF', N)  
data.history.create_ohlcv_buffer('EURCHF', '1Min', N)  
data.history.create_ohlcv_buffer('EURCHF', '1D', N)

I really think that it's necessary to store both not compressed data (such as tick data) and compressed data (like OHLCV)
because you can't rely on resampling big ticks data into 'handle_data' method
Do you resample tick data or even 1 minute timeframe data to get OHLCV daily data 7 days before ?
If you are doing this it will probably be very long and also very memory consuming !

An implementation of this RollingOHLCV could be:
(it should probably be improved)

class RollingOHLCVData():  
    def __init__(self, N):  
        self.size = N  

        self.df = pd.DataFrame(index=np.arange(N-1,-1,-1))  

        self.df['open'] = np.nan  
        self.df['high'] = np.nan  
        self.df['low'] = np.nan  
        self.df['close'] = np.nan  

        self.df['volume'] = np.nan  

        self.flag_first_append = True

    def append(self, price, volume, new_candle=False):  
        if new_candle or self.flag_first_append:  
            print("new_candle")  
            self.flag_first_append = False

            self.df = self.df.shift(-1)  

            self.df['open'][0] = price  
            self.df['high'][0] = price  
            self.df['low'][0] = price  
            self.df['close'][0] = price

            self.df['volume'][0] = volume  

        else:  
            self.df['close'][0] = price  
            if price > self.df['high'][0]:  
                self.df['high'][0] = price  

            if price < self.df['low'][0]:  
                self.df['low'][0] = price  

            self.df['volume'][0] += volume  

    def __repr__(self):  
        s = self.df.to_string()  
        return(s)

Into handle data we could access to price

like data.history['EURUSD']['OHLCV']['1Min']['close'][0]

you could also have access to k bar backward data.history['EURUSD']['1Min']['close'][k]

What should happen if no tick happens during 1 minute ?

I think we should keep previous close bar price and set volume to 0
(we could also add a flag column into the rolling dataframe for that case)

That's just an idea...

An other idea (but maybe after implementing what I wrote before) could be
use a kind of "lazy" structure which will be created depending of what you need.

for example if you receive a tick data for EURUSD and a source code line into
handle_data of your strategy requests 2 bars before current bar open price
zipline could say "Oh my dear!!! I haven't store this data... Let's build a structure
able to store what they are requesting me! It will be ready for next event!"

In such a case maybe accessing data using '[' ']' is not the best idea and we should provide
an accessor for that.... maybe we could also overload operator [] using itemgetter

W4C

Hello W4C,

I cannot speak for Quantopian but I assume the first paragraph of their Business Plan specifies an event model based on 1-minute data (from which is derived daily data) implemented in, and leveraging, Python/pandas.

From this I believe data alignment follows as this is the nature of Data Frames and Data Panels. Should Forex data ever be available - other than via 'fetch' - I have no doubt that this will be aggregated into 1 minute OHLCV bars. Additionally, I doubt that Bid/Ask prices will ever be offered within the Quantopian data.

Spread is an important problem in a Forex strategy (or when trading CFD). I don't understand why Quantopian could ignore this.

When trading CFD, you can buy/sell contracts (underlying is shares) during night (market closed) and so spread is different (higher) than daily spread (market open).
Moreover 1 minute OHLCV bars is not enough if you want to trade realtime... (at least for some scalping strategies) you need to feed your strategy using tick data.
Some Forex tick data are available here http://www.truefx.com/ but you can probably find other tick data elsewhere for backtesting

About live trading. You can get data not only "fetching" them but also using broker API such as Interactive Brokers (or probably some others)

here is a sample
https://code.google.com/p/trading-with-python/source/browse/#svn%2Ftrunk%2Flib%2FinteractiveBrokers
(from http://www.tradingwithpython.com/?page_id=504 )
https://code.google.com/p/ibpy/

Hello W4C,

I'm not disagreeing with you. I'm just relaying an impression I get from https://www.quantopian.com/posts/timeframes-discussion

"@Peter - good question - but totally unrelated to the historical data timeframes question I was trying to answer here :)

Quantopian's backtesting and live trading are all built around minute-level pricing data and that is the fastest access that Quantopian-IB link will give you to the market. For the equity markets this is pretty standard for any type of trading that falls outside of market making or what is often referred to as high frequency trading' or HFT. As you described, you are exposed to intra-minute price risk."

Hello Peter,

sorry for being so rude...

if Zipline can't provide what I'm looking for (tick data storing) and also different rolling buffers to store OHLCV data for different timeframes... I will probably build something myself but I prefer to first have a look at what exists (in order not to reinvent a square wheel).

Moreover there is also an other problem about storing volume (are they integer ?)
Quantopian / Zipline seems to be first "shares oriented" so volume is integer and that's a problem for both Forex and CFD where volume are decimal (with a fixed number of digits)

I see two ideas about it:
- using python decimal type http://docs.python.org/2/library/decimal.html
- store every data as integer but when you get price or volume it will be divided by 10^digits and apply a function to "normalize" using a given digits number
(that's what Metatrader 4 is doing http://docs.mql4.com/convert/NormalizeDouble ) but that's maybe not the best example.

So there is a need for a variable like Digits http://docs.mql4.com/predefined/variables/Digits or more general a function like MarketInfo
http://docs.mql4.com/common/MarketInfo

That's again my own opinion which is linked to markets in which I'm interested in.

@Simon

"If I were a TA man, coding a 50day > 200day moving average crossover, would I ever want that inequality tested every minute?"

I also think that managing this into handle_data is not a good idea.

Maybe a kind of signal/slot pattern could be useful.

Signal send when there is a new candle on a daily candlestick chart.
and linked to a slot

Oct 4, 2013

In order for data.history to accumulate data, does it need to be called for every call to handle_data? Or could it be in a conditional statement (if condition met, then call data.history)?

Also, does data.history have to be in handle_data? Or can it also be in other functions?

For the warm-up, you should provide more detail on how it will be handled, including when the benchmark will start (see my comment on https://www.quantopian.com/posts/benchmark-has-returns-for-day-1-why). Presently, it seems that there is a problem with the start day/minute of the benchmark, since it starts when an order can be submitted (the start of the backtest). Shouldn't it start on the earliest day/minute a gain/loss could be realized?

I suggest adding a "Technical Implementation" section to https://github.com/quantopian/quantopian-drafts/blob/master/data-history-draft.md. It is not clear how data.history will work "under the hood." At the start of a backtest, I gather that you will somehow find all references to sids, and then start accumulating the data for all of them, prior to the start of the backtest. What happens if the list of sids changes during the backtest? Or what if the user wants to delay the accumulation? Also, there must be a memory limitation? You might want to provide guidance on how to manage it, in the limit of a large number of sids and a large window size.

Richard Diehl

Eddie,
So. If I understand it correctly, I can use either data.history.minutes('3d', 'price').resample("1D") or data.history.days('3d', 'price')
to get a daily price series, where the last day uses the most recent partial values. So I can trade at for example an hour prior to close, and examine the daily charts (just as I would with stockcharts.com). And I can at the same time look at minute data or other data to fine tune entries and exits. Sounds perfect. It's also a bit more transparent than batch transform.
Rich

Hello Eddie (and Richard),

I'm kinda unclear, as well. It seems that to apply the pandas resampling to obtain daily bars in a minutely backtest, it would be (as Thomas W. shows above for different timeframes, data.history.minutes('60m', 'price').resample('15m', how='ohlc')):

data.history.minutes('5d', 'price').resample('D', how='ohlc')

Would this return, every minute, the prior 4 days of daily OHLCV data (i.e. a pandas timeseries with 4 rows)? And at the closing minute, would it return 5 days of daily OHLCV data (since at closing, the current day's OHLCV would be available?

Or would it return 4 full days of OHLCV and a partial day of OHLCV (up to the current minute) (i.e. a pandas timeseries with 5 rows, with the 5th row changing every minute).?

Grant

Oct 7, 2013

Thank you so much for your thoughtful and passionate reviews of the data.history specification. Several of your suggestions have been incorporated into our latest revision, and we believe it is a much stronger design as a result.

You can see the latest spec here: https://github.com/quantopian/quantopian-drafts/blob/master/data-history-draft.md

We have started implementing the data.history method as specified. We will keep you updated on our progress, and we welcome further feedback.

Disclaimer

Oct 7, 2013

Nice. re: the field, does "price" lead to a frame that contains all of open/high/low/close ? Or if one wants to build OHLC bars, one must make four history requests ?

Oct 7, 2013

Hello Eddie,

Could you elaborate on:

Frequency of one day and above (i.e., '1d', '2d', '1W','1M', etc.) have a limit of being within the date range of the backtest. e.g. if the simulation date range was from 2011-01-01 to 2011-12-31, the limit for bar_count at a frequency of '1D' would be 252, the limit for a frequency of '2D' would be 176, for '1M' would be 12, etc.

If the backtest dates were 2011-01-01 to 2011-06-30 what would be returned by:

data.history(bar_count=200, frequency='1d', field='price')

i.e. 200 bars or approx. 126 bars? (I'm guessing '176' above is a typo. not bad arithmetic!) I'm thinking in terms of a 200-day MA in a six month backtest.

Are multiple history requests allowed? If so is the 2000 data point limit for all or for each?

Simon:

With the current spec, yes, four history requests would have to be made.
But, we had discussed it while working on the spec, and while adding the ability to return multiple at once is out of scope for the first release of the new API, it's something that we may enable in the future via an additional parameter.

Peter:

Starting with the first handle_data call of the backtests, i.e. at bar 0, the history will return with the 199 preceding business days in 2010 and the first business day in 2011.

Multiple history requests are allowed, a goal is to enable multi-factor algorithms.
Some of the limits will be tuned and re-calibrated during development, but currently it's envisioned that the 2000 cap only applies to minute, and for daily pricing data, all available bars within the universe.

Disclaimer

Thanks Eddie,

Regarding the 2000 minute bar limit ("In minute simulations, the number of data points in a history is limited to 2000..."), this sorta rules out the multiple time frames some folks were looking for, since it is only ~ 5 days. I'd assumed you'd provide a much longer window (e.g. 30-90 days of minute data), that could then be re-sampled down to daily OHLCV bars (using pandas).

Or perhaps one can actually obtain 2000 days, in minute mode, with:

data.history(bar_count=2000, frequency='1d', field='price')

Presumably a > 2000 bar minutely window will still be available with the batch transform? Or will it be limited, too?

Grant

Grant:

Your example would work in both daily and minute modes, and you've zeroed in one on of the main driving features of the new API. i.e., opening up longer time frames to minute backtests.

We may need to improve the language of the spec as it moves from spec to documentation, but in the section that talks about the limits, the units were meant to be in the bars selected by the history function, not the units of the backtest bars.

Disclaimer

In the "Illiquidity and Forward Filling" section, I suggest adding a bit more detail:

Presumably, prices for missing bars will be forward-filled, but volumes will be set to zero.
As for the batch transform, if there is no prior (i.e. the first bar in the window is missing), there will be no filling.
In the case of all empty bars over the entire window, will the security be listed? Or will the column be dropped?

Grant

Oct 18, 2013

Hi Eddie,

I'm a bit confused about the relationship between the data that will be available via history, and the stream of events that the backtester uses. It appears that the event stream is already forward-filled, prior to the backtest start (e.g. https://www.quantopian.com/posts/thinly-traded-stocks-why-no-gaps), with no way to turn off the filling. Presumably, the history method won't be operating on the filled data, but rather the unmodified event stream, right? But will handle_data still be running on the forward-filled event stream (with non-zero volumes for non-events and lagging datetime stamps)? If so, I think that this could lead to confusion. For example, if I turn off the forward filling on the history method, then the data provided by history won't match the event stream data (assuming that you continue to forward fill the event stream data with no way to turn off the filling).

Sorry, I'm having a trouble articulating the problem...perhaps you or someone else can clarify my point (or set me straight that there is no problem). I think it comes down to laying out how history, handle_data, order, slippage, etc. will manage the multi-sid case with some (or all) of the securities not trading every minute/day.

Grant

Dec 6, 2013

Hello Eddie,

How is the order of the dataframe columns determined (see https://www.quantopian.com/posts/history-api-how-to-order-sids)? Could the columns be ordered by increasing sid number? Presently, it appears that this is not the case.

Also, you never replied to my Oct. 18 post immediately above. If you get the chance, please jot down a response.

Thanks,

Grant

Jessica Stauth

Dec 6, 2013

Hi Grant,

We are getting very close to releasing a first cut of the history() API, so I think you're questions will largely be answered then, or at least we'll have a version of docs we can iterate on.

For your question on ordering would you mind sharing a use case so that I can get a better handle on how to advise you? In my experience python code that relies implicitly on ordering as opposed to explicitly on indexing tends to be brittle.

Best, Jess

Disclaimer

Dec 7, 2013

Thanks Jess,

One use case is to get the sid data into a numpy ndarray for analysis, in a specific order. Presently, if I understand correctly, the history API does not order the securities in the dataframe output (or the order is not obvious). One approach is to use a sort on the list of sids (e.g. by sid number), and then apply the same ordering to the dataframe (as I show in https://www.quantopian.com/posts/history-api-how-to-order-sids). It would be preferred to apply an arbitrary order to the dataframe. If I poke around in Pandas I can probably sort it out. If you or anyone there knows how, just let me know.

Grant

Jan 17, 2014

Quantopian folks,

How's the full release of the history API coming along? It would be nice to have access to the trailing data down to the minute level (other than via the legacy batch transform or by writing a custom accumulator), with an algorithm "warm up" period.

Grant