Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Multicollinearity And Multiple Linear Regression Cheat Sheet

Hey everybody, I'm pretty new to Quantopian as well as python and have been spending the past two months immersing myself into both worlds. I've been coding for > 25 years but my math skillz are rusty I urgently had to refresh my knowledge on linear regression and related topics. Thus decided to extract some code shown in various LR related lectures and turn it into my own little linear regression cheat sheet presented here. This accounts also for multiple independent series plus it demonstrates multicollinearity. Perhaps it'll help clear up all those mystic variables you encountered in linear regression but never dared to ask ;-)

P.S.: If you see any errors or have any suggestions don't be shy and point them out in the comment section. Thanks in advance.

10 responses

Thanks M8

So alpha and beta in the stock market (and backtester) are linear regression (or OLS, ordinary least squares) alpha and beta, fed particular inputs?

Blue, I don't understand your question - can you elaborate?

The background is a little complex ...

  1. The terms alpha and beta along with other greek letters are used all over the place in engineering and science. For example, most have heard of alpha and beta radiation, and gamma.
  2. Linear Regression (with its alpha and beta) is essentially a trend line and yet more than that. Regression? Good question. It's no secret that statisticians like to come up with new terms, check out this glossary of statistics terms if you don't believe it. "Regression" was a new term back in the Victorian era from Darwin's cousin. They made an exception with the generic oft-used terms alpha and beta. I had grown accustomed to understanding beta as slope of the trend line since I was seeing it used that way everywhere (one exception). But now I'm starting to get, that it isn't always slope, can be other things depending on the inputs. Slope is only if the set being compared to, is a constant such as time, I think. Not with two varying sets, as in collinearity perhaps.
  3. One often hears about stock market alpha, there's AlphaLens developed by Quantopian, Seeking Alpha, alpha factors etc. Everywhere alpha is always considered a good thing, if not the main thing. And yet the explanations may put you to sleep if you weren't born on Wall Street or have enough reference points of knowledge already. For example:

"Alpha measures the difference between a fund's actual returns and its expected performance, given its level of risk (as measured by beta). A positive alpha figure indicates the fund has performed better than its beta would predict. In contrast, a negative alpha indicates a fund has underperformed, given the expectations established by the " ...

.... blah blah yadda etc. Tough without experience already. Don't get me wrong folks, I'm not looking for anyone to give me their understanding of alpha. I'm asking very specifically whether the two alphas are the same, as it would now be astounding to me if they are not.

It finally dawned on me that the terms alpha and beta in the stock market were not coined out of nerdy lack of vocabulary and instead are precisely the same as alpha and beta in linear regression for particular inputs. In some cases beta is not slope at all (or is it still?) like in returns vs SPY for beta. For alpha, returns vs tbills or something like that, possibly the alpha straight from linear regression. Those alpha and beta might be the same, which is nice if so, as that would make sense.

Meanwhile, ran across this which I suppose sort of tends to confirm it (or is at least in the same ball park) though any extra clarification welcomed:

"the beta you get from Sharpe's derivation of equilibrium prices is essentially the same beta you get from doing a least-squares regression against the data. (Also note that alpha and beta are standard symbols that statisticians use all the time for this type of regression; Sharpe and his followers weren't trying to be obscure, as some people like to believe.) "

If anyone is interested in understanding linear regression, the notebook above is helpful and the following is not to be missed, both visual and interactive as well: http://setosa.io/ev/ordinary-least-squares-regression/

@Blue - the terms alpha and beta may arguably be overused in engineering (disclaimer - I used to be an engineer) but in finance they are very commonly used terms. I have beefed up the multicollinearity section in my sheet a bit and also added a hedging example I have been playing with.

The basic idea is that as a trader you are looking for an 'edge' and it is preferably one that is independent of the market (in most cases the S&P). So let's say you decide that FSLR somehow has meaning to your system, but it can also be some sort of indicator or fundamental measure. The first thing you probably want to do is to figure out if it's really FSLR (or your indicator's values) that's giving you that edge (a.k.a. alpha) or if it is in part or even perhaps mostly the market, which as you may have guessed is referred to as beta. In other words:

net alpha = raw alpha - beta * market

You can, for example, obtain the slope (beta) of a list of prices using linear regression like this where the curve data points are being compared to a constant. I think params would be two elements, so params[-1], beta, is the second of two and alpha would be the zero-index counterpart, and in this case with the constant maybe not interesting.

import statsmodels.api as sm

        slp = slope(data.history(stock, 'close', 60, '1m').dropna())    # Minutes, note dropna(), important

def slope_calc(in_):  
    cnstnt = sm.add_constant(range(-len(in_) + 1, 1))  
    return sm.OLS(in_, cnstnt).fit().params[-1]  # slope, regression beta  

And you can even obtain slope as a pipeline factor for multiple stocks simultaneously, for all stocks you are screening because statsmodels likes the ndarray that pipeline provides, although you would need to screen this further for nans at some point, like in before_trading_start:

def make_pipeline(context):  
    pipe = Pipeline()

    slw  = Slope(window_length=300, mask=Q1500US())  
    fst  = Slope(window_length=20 , mask=Q1500US())  
    [...]

class Slope(CustomFactor):  
    inputs = [USEquityPricing.close]  
    def compute(self, today, assets, out, closes):  
        out[:] = slope(closes)

def slope(in_):     # Return slope of regression line.  
    return sm.OLS(in_, sm.add_constant(range(-len(in_) + 1, 1))).fit().params[-1]  # slope  

^^ I fixed a few errors just now and have re-attached the latest version here.

I'm mainly interested in trading futures actually. Does the Pipeline work with the futures calendar now?

Also, has anyone implemented an simple indicator as a filter for the Pipeline? I'm thinking filtering out all the instances when e.g. the symbol's price was not trading < the lower BB or > the upper BB. Then feed that into the Pipeline and see what I get. The Pipeline examples I have seen thus far are all geared toward managing a stock portfolio separated into buys and sells. I want to use the Pipeline for single symbols or perhaps pair trades.

Yes, alpha and beta of the market are straight from linear regression:

Quantopian Summer Lecture: The Art of Not Following the Market
https://www.youtube.com/watch?v=Af0l3TQJ3h8&feature=youtu.be&t=22m14s

^^^ Yes, I incorporated some of this into my sheet. This was actually a key lecture for me and surprisingly I don't see hedging against market beta applied very often in strategies.