Kaboom!

Aug 27, 2012

I tried to just naively switch buy/sell orders but it did just as badly.

Disclaimer

Aug 27, 2012

@Thomas, that was my first guess too. That amounts to betting that after the spread is more than 2 standard deviations wide, it will get wider still, so I would intuitively expect that to fare poorly too.

Disclaimer

@fawce, I wonder how much of your losses are derived from depreciation of your portfolio and how much are simply from frictional costs associated with opening and closing your positions. If I understand your algo correctly, I think if the spread fluctuated around 2.0 or -2.0 standard deviations, you would end up buying and re-selling the same chunks of stock over and over. That would also explain why inverting the bets doesn't change your performance, since you're doing the same fluctuation except inverted.

Disclaimer

@Scott
I was worried about the same, seems like I could test it by varying the standard deviation range for buying/selling. My worry is that there is a trend for the spread to continuously widen, which would tend to put the spread at a large delta from the trailing mean. If that is the case, I'm not sure how to make this profitable without just betting the spread will widen.

Disclaimer

@fawce
.9 also seems like a really high minimum on your r^2 value to reduce your position. I wouldn't be surprised if you're almost always reducing your position every time you buy or sell the spread (which would have the same effect as the price fluctuating around the stddev window)

Disclaimer

Also, if your r^2 is less than .9 but the absolute value of your zscore is greater than 2.0, I think you'll end up placing opposite orders within the same call to handle_data, which will further your losses to frictional costs.

Disclaimer

@Scott, great catch, that should have been an elif - I just fixed it and I'm re-running...

Disclaimer

I threw some print statements into your algo, and it looks like your r^2 values are actually coming out negative. While applied statistics is not my area of expertise, I'm pretty sure this is bad news for the predictive power of your model. Quoting from wikipedia:

"Important cases where the computational definition of R2 can yield negative values, depending on the definition used, arise where the predictions which are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data, and where linear regression is conducted without including an intercept. Additionally, negative values of R2 may occur when fitting non-linear trends to data. In these instances, the mean of the data provides a fit to the data that is superior to that of the trend under this goodness of fit analysis."

Disclaimer

@Scott
Ah, now this makes a lot of sense. I fixed the if statement on line 170 to be elif, and the algorithm never trades! The r^2 guard in the algorithm was meant to protect against the condition you're describing above - where the assumed relationship between the two instruments proves to be too weak to trade. But with the bug, it was trading both sides of the guard and steadily draining capital in transaction costs.

Oh well, I guess that's what backtesting is for...

Disclaimer

@fawce
I think the issue was (at least in part) with your use of statsmodels to build the linear regression. The OLS model is really meant for more generalized (ie, multidimensional) correlations, and I think it assumes you've done some amount of normalization on your data so that it doesn't have to account for constant factors. (This is all a bit speculative. I went and hunted through the statsmodels api reference a bit and found it pretty impenetrable. I think I'd have to sit down with an applied stats book for a while to properly figure out what's going on.) At any rate, since all we're doing is regressing two different series, a much simpler tool is scipy.stats.linregress. I ran that against the same data and got much more sensible results:

2012-01-17 14:31:00 PRINT rsquared 0.018384, beta -0.095647

2012-01-30 14:31:00 PRINT rsquared 0.307240, beta 0.339751

Disclaimer

hmmm...so I just ran a long backtest with the same strategy, but using the following to calculate the relevant stats metrics:

from scipy import stats

p1 = [x.price1 for x in self.ticks]

p2 = [x.price2 for x in self.ticks]

gradient, intercept, r_value, p_value, std_err = stats.linregress(p1, p2)

self.rsquared = r_value **2

self.beta = gradient

This should calculate a linear regression between the price of p1 and p2, such that in the model:

p2 = gradient * p1 + intercept

It seems like the rsquared of the model fluctuates wildly as you recalculate. There are periods in 2006 where you have decent correlation (r^2 hanging out around .6 or higher) and other periods where it's less than .01, which would mean that the two sids are moving essentially at random relative to one another. I think improving on this algo would require figuring out a more reliable way to determine when the integration between the sids breaks down.

Disclaimer

Instead of looking at the R^2 you can also look at the p-value which tells you how likely it is to get this coefficient if both were totally uncorrelated. Often p < .05 is considered to be significant (i.e. the probability of obtaining these results as a fluke are less than 5%).

Disclaimer

@Thomas you wouldn't happen to know how to coax the p-value out of statsmodels, would you?

Disclaimer

@Scott
I noticed another difference between this example and the gld/gdx version - here I am using a bet size of 100 shares, and in the gld/gdx I am using 5 shares. I switched to 5 shares here and got very different results - still not good, but not the grinding pit of despair pictured above. I think the larger betsize increases the slippage and transaction costs to a level that overwhelms the value of the arbitrage.

Disclaimer

@fawce: I think you might want to add

p2 = sm.add_constant(p2)

to add a constant factor of 1, otherwise you are not estimating an intercept (which is why @Scott's code should have produced something else, I think scipy is estimating an intercept by default). The reason is that statsmodels is estimating the general Y = X*beta + eps model. You want to estimate Y = X*beta + intercept + eps. Thus if you have a column of 1's in X it ends up estimating what you want.

To get the p-value you have to compute a Student t-test. After after line 103 you can add:

pvalue = results.t_test([0,1]).pvalue

Disclaimer

@fawce: that could also explain why you sometimes get negative R squared values.

Disclaimer

J.J. Kaljuvee

Aug 30, 2012

@fawce there are two obvious improvements to the OLS you're using: (i) account for the intercept in the spread, (ii) use a symmetric version of the OLS known as TLS. It's detailed here in R - http://quanttrader.info/public/testForCoint.html

@J.J. Thanks! The issue of the intercept has been raised again in this thread. I can see mechanically how to account for the intercept in the spread, but I'm not grasping intuitively the meaning. Any chance you could take a moment and try to provide an explanation? Thanks for the link.

Disclaimer

Also, just wanted to provide a link to the ever reliable Wes McKinney's statsmodel demo. Wes built an ipython notebook filled with examples, starting with none other than the OLS with its intercept.

Disclaimer

@J.J. looks like @Thomas took your advice and added an intercept over here.

Disclaimer

J.J. Kaljuvee

@fawce
Would say that the intuition is that if you omit the intercept you're systematically over/underestimating the spread. To illustrate, let's say the actual model is:
spread = p1 - beta*p2 - alpha
By assuming alpha = 0, you overesitmate the spread when alpha > 0 and underestimate it when alpha < 0

There other things you can do to optimise the spread / zscore:
- using logarithms
- using different window sizes
- using volatility decay - weigh recent history more heavily

I tried logs by cloning your code but somehow it didn't work very well :) It works in my own implementation though...

Had another question for you - what's the meaning / difference here:
42. spread = price1 - price2
60. self.spread = data[self.stock1].price - self.beta * data[self.stock2].price