Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
How many datas should I take for calculating the history() by timeframe resampling?

Hi,

In my algo I try to resample the day data to 60-min. data to calculate the 60-min's RSI. I first get/catch the historical data set. Theoretically the longer/more datas I catch, the more excatly by calculating the indicator value. But in fact it looks different. Here is the code line for example:

... prices_1m = data.history(context.stocks, 'close', context.rsi_lookback*context.timeframe, '1m').dropa()
...

I've also compared the RSI-values with those by google/finance or other charting tools. They are somewhat different.

Maybe can someone tells if this is correct?

Cheers

5 responses

When using history with a list and '1m', dropna drops the time series entry for every stock for every minute where any single stock had a missing price (nan, meaning not-a-number). It is the scorched-earth approach. This may explain the unexpected results you're seeing. No stock would be any more populated than the worst, most thinly traded. If there is a stock in the list that didn't trade for 40 minutes, those 40 minutes would suddenly disappear for all stocks in the list.

For that history call, try forward and back fill, .ffill().bfill() instead. Forward fill says basically, if there is a minute with nan (no trades), just fill it with the previous minute price, assuming the price didn't change, and it will operate on each stock individually. Back fill says, if the window for that security started out with nan's, fill them with the first known price. Forward must be done first, otherwise there would be prices from the future placed in the past, look-ahead bias. bfill is used at all only as an act of desperation for the start of look-back windows to avoid even worse results for thinly traded stocks.

prices_1m = data.history(context.stocks, 'close', context.rsi_lookback*context.timeframe, '1m').ffill().bfill()  

On the other hand, dropna is not always bad. It depends on the way the pandas object is structured. For a single stock, dropna is ok. Coming in from pipeline, with its different structure, dropna will lose only the security where any of its own values for the pipe's added columns are nan so it can be beneficial there. Someone will correct me if I'm wrong. This is fine, for pipeline:
context.stocks = pipeline_output('pipe_name').dropna()

With history '1d' it isn't an issue unless I'm mistaken as it would be odd for any stock to have ever had a day where it never traded, where all minutes were nan and thus the '1d' price not populated.

For the wider audience, why all this concern about nan's?
NaN values make it through comparison checks for example, equal-not-equal or greater-than, and survive math without error. They produce false results, and ta-lib usually (or always) will not complain if there aren't too many, and the more there are, the less accurate the resulting figure. It seems to me Quantopian might have the best data available so ffill and bfill would be the closest one can come for RSI etc. One might want to ask themselves, if using ffill and bfill and backtest RSI or other ta-lib operations don't match some site out there, are those sites the gold standard or is the data here actually better and these RSI values more accurate then.

Hi Blue,

Thanks for the reply!

I've tried using your code. But the results are the same. Have you tired? I've also used the 'price' instead 'close'. The same.

Cheers

Also replace the second instance of dropna(). I was so certain about this principle I had not even looked at your code. Now having made those changes and seeing the new RSI numbers, the difference is striking. Would like to know whether RSI lines up better now with other sources.

Second instance of dropna()? You mean here?
... prices = prices_1m.resample('60T', closed='left').last().ffill().bfill()
...

The results are now different indeed, but very very far different. :-/

Besides, what I want to know is, how long or how many history data I should take to calculate. For example as follow:

    prices_1m = data.history(context.stocks, 'price', context.rsi_lookback*context.timeframe, '1m').ffill().bfill()  

Here I use 'context.rsi_lookback*context.timeframe' (14* 60). But when I take 14*60*10 or 14:60*20, I got different results.