Building a Better Beta

Back to Community

edited

Today we're announcing the release of SimpleBeta, a new Pipeline API factor that makes it easy to efficiently calculate market beta for all stocks in a trading universe.

There are many reasons why an algorithm might be interested in having faster access to rolling beta calculations, but one important reason right now is that the upcoming new contest rules will require algorithms to maintain low market beta.

In the attached notebook, we provide an introduction to "market beta" and explain how you can use the new SimpleBeta factor to calculate market beta much more efficiently than was previously possible using Quantopian's built-in APIs.

A few highlights to whet your appetite:

SimpleBeta is roughly 135x faster than RollingLinearRegressionOfReturns for 1-year beta calculations. This speedup gets more significant for shorter periods (e.g. 90-day beta), and less significant for longer periods, but it's generally on the order of 100x or more.
SimpleBeta robustly handles missing inputs, allowing it to reduce the number of NaN beta values it produces by 50% compared to RollingLinearRegressionOfReturns.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

30 responses

Leo M

Many thanks Scott. This new SimpleBeta factor has made my backtests insanely fast. I am seeing like an order of magnitude speedup in my 10+ year backtests.

Guy Fleury

@Scott, thanks, very helpful. Great notebook. Impressive work.

Scott Sanderson

It looks like a few of the LaTeX equations got mangled when I posted this notebook to the forums (we're looking into a bug in our code that exports forum-posted notebooks) . Anywhere in the equations that you see *sample, that should be "sample" subscripted instead.

Disclaimer

Scott Sanderson

@Guy @Leo glad you found this useful!

Disclaimer

Burrito Dan

Scott, this is about 10% faster, but not as readable:

assets -= assets.mean(axis=0)  
spy    -= spy.mean(axis=0)  
return np.dot(assets.T,spy) / np.dot(spy.T,spy)

You can also remove the new axis:

SPY_column = rets[SPY].values #[:, np.newaxis]

Scott Sanderson

@Dan. Good call noticing that (a * b).sum() is just a.dot(b). I ended up simplifying the vectorized_beta implementation in the post a bit in the interest of clarity, though I actually didn't think about just using .dot! In the real implementation in zipline, we're not using dotbecause there's no nan-aware version of it in numpy (at least not one that I'm aware of).

For what it's worth, I think this is the fastest pure vectorized beta implementation I can muster without resorting to Cython or C:

def fastest_vectorized_beta_I_can_muster(spy, assets):  
    # Allocate one len(assets) array and fill it initially with the  
    # column-means of assets.  We'll re-use this buffer several times in the  
    # course of this function.  
    buf = assets.mean(axis=0)

    # Subtract the means from each asset in-place.  
    # Note: This mutates the input in place, so don't do this if the caller  
    #       expects to use `assets` again!  
    np.subtract(assets, buf, out=assets)

    # Overwrite the output of the "covariance" dot product into `buf`.  
    spy.dot(assets, out=buf)

    # Overwrite the output of the division into `buf` again.  
    np.divide(buf, spy.dot(spy), out=buf)

    # buf now holds our expected output.  
    return buf

On my laptop for a 504-day lookback with 1000 assets, this is about 3 times faster than the vectorized_beta function in the notebook:

In [82]: spy.shape  
Out[82]: (504,)

In [83]: assets.shape  
Out[83]: (504, 1000)

In [84]: spy2d = spy[:, np.newaxis]; spy2d.shape  
Out[84]: (504, 1)

In [85]: def vectorized_beta(spy, assets):  # This is the version from the notebook.  
    ...:     asset_residuals = assets - assets.mean(axis=0)  
    ...:     spy_residuals = spy - spy.mean()  
    ...:  
    ...:     covariances = (asset_residuals * spy_residuals).sum(axis=0)  
    ...:     spy_variance = (spy_residuals ** 2).sum()  
    ...:     return covariances / spy_variance  
    ...:

In [86]: %timeit -n500 vectorized_beta(spy2d, assets)  
500 loops, best of 3: 2.81 ms per loop

In [87]: def fastest_vectorized_beta_I_can_muster(spy, assets):  
    ...:     buf = assets.mean(axis=0)  
    ...:     np.subtract(assets, buf, out=assets)  
    ...:     spy.dot(assets, out=buf)  
    ...:     np.divide(buf, spy.dot(spy), out=buf)  
    ...:     return buf  
    ...:

In [88]: %timeit -n500 fastest_vectorized_beta_I_can_muster(spy, assets)  
500 loops, best of 3: 936 µs per loop

I'm not sure whether fastest_vectorized_beta_I_can_muster gets most of its speedup just from the cost of extra allocations, or if it's because we're getting better cache locality. I suspect it's a mix of both.

One thing I'd like to do in a future update would be to add a fast path to SimpleBeta that uses something like the above if you pass allowed_missing_percentage=0. The current implementation pays a steeper performance cost than I'd like in exchange for handling missing data robustly. We can probably claw some of that back my moving the implementation to Cython (which would let us remove at least one large allocation), but ultimately handling nans correctly requires a branch per array element, which is a real cost at this level of abstraction.

Disclaimer

Burrito Dan

THAT is some code-fu :)

Delaney won't like this, but you could break the maths for a significant speed up. You could replace the entire function with simply:

def vectorized_beta(spy, assets):  
    return np.dot(assets.T,spy) / np.dot(spy.T,spy)

This gives a huge speed up, for a very minor reduction in accuracy:

AAPL beta is [ 1.07453746]  
It took 6.81426151689e-05 seconds to calculate betas for 1755 stocks.

Demeaning doesn't have much impact on the calculation, as the variance is so much bigger than the mean. Only the first order terms matter.

def vectorized_beta(spy, assets):  
    asset_residuals = assets - assets.mean(axis=0)  
    spy_residuals = spy - spy.mean()

    covariances = (asset_residuals * spy_residuals).sum(axis=0)  
    spy_variance = (spy_residuals ** 2).sum()  
    return (covariances / spy_variance, np.dot(assets.T,spy) / np.dot(spy.T,spy))

SPY_column = rets[SPY].values[:, np.newaxis]  
betas, duration = timeit(vectorized_beta)(SPY_column, all_returns.values)  
pd.DataFrame(betas[0][:,np.newaxis] - betas[1]).hist(bins=100)

You can see the the range of absolute deviations from the exact beta is about 0.02 for the vast majority of stocks in this instance, i.e. about 2% of the mean beta of 1.

Scott Sanderson

Oh, actually, I think you can get almost all of the benefits of your speedup and still recover the original results! It turns out that you only need to de-mean one of the arrays to get the correct result (we actually take advantage of that fact in the zipline implementation).

In my implementation above, I'm de-meaning assets, but that's silly, since it's a 500 x 1000 array. We can get the same result by just de-meaning SPY (a 500 x 1 array) instead!

def fastest_vectorized_beta_I_can_muster_round_two_electric_boogaloo(spy, assets):  
    buf = np.empty(assets.shape[1])  
    # We only need to de-mean one of these arrays, and SPY is a lot less work...  
    spy -= spy.mean()  
    spy.dot(assets, out=buf)  
    np.divide(buf, spy.dot(spy), out=buf)  
    return buf

That gives another ~10x speedup for me, with no loss of precision.

Disclaimer

Burrito Dan

Awesome! I guess this is keeping only one of the two second order terms, and the other one is well behaved.

We're splitting atoms now, but you could:

def fastest_vectorized_beta_I_can_muster_round_two_electric_boogaloo(spy, assets):  
    spy -= spy.mean()  
    np.divide(spy, spy.dot(spy), out=spy)  
    return spy.dot(assets)

Luca

We're splitting atoms now
fastest_vectorized_beta_I_can_muster_round_two_electric_boogaloo

LOL

Burrito Dan

🎁 All I want for Christmas is an optimised inner loop.

Jean Bredeche

All I want for Christmas is an optimised inner loop.

I need that on a t-shirt.

Disclaimer

Ann

Whoah! Thank you!

I use beta heavily, and my hand coded beta (using np.cov) takes about 10x the time and 5x the memory, to get results that aren't different enough to matter.

Peter Harrington

Thanks for this post! What a wonderful Christmas present.

I was not a huge user of Beta as it was too slow, however I do use correlation and covariance often and they were slower as well. I have rewritten cov() and corr() using the ideas in this thread and the Zipline implementation of SimpleBeta. I have tested them for correctness and speed, much faster.

def fast_cov(m0, m1):  
    """Improving the speed of cov()"""  
    nan = np.nan  
    isnan = np.isnan  
    N, M = m0.shape  
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)  
        isnan(m0),  
        nan,  
        m1,  
    )  
    ind_residual = independent - nanmean(independent, axis=0)  
    covariances = nanmean(ind_residual * m0, axis=0)         

    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count  
    covariances[nanlocs] = nan  
    return covariances

Here is covariance:


def fast_corr(m0, m1):  
    """Improving the speed of correlation"""  
    nan = np.nan  
    isnan = np.isnan  
    N, M = m0.shape  
    out = np.full(M, nan)  
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)  
        isnan(m0),  
        nan,  
        m1,  
    )  
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)  
    covariances = nanmean(ind_residual * m0, axis=0)  # shape: (M,)

    # corr(x,y) = cov(x,y)/std(x)/std(y)  
    std_v = nanstd(m0, axis=0)  # std(X)  could reuse ind_residual for possible speedup  
    np.divide(covariances, std_v, out=out)  
    std_v = nanstd(m1, axis=0)  # std(Y)  
    np.divide(out, std_v, out=out)

    # handle NaNs  
    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count  
    out[nanlocs] = nan  
    return out

Apart from some code reuse, let me know if there are any obvious ways I can speed this up. I hope someone else finds this code useful.

Grant Kiehne

Here's an example of usage of the new SimpleBeta in conjunction with the Optimize API. I have the beta constrained to +/-0.05, yet it is higher. Any idea why?

I'll post the tear sheets next.

Grant Kiehne

Tear sheets for backtest posted immediately above.

Blue Seahawk

Expected some negative values from SimpleBeta, seeing none.
Edit: None early, then later only a couple here and there occasionally out of 300 stocks. Weird.

    context.pipeline_data = pipeline_output('my_pipe')  
    if len(context.pipeline_data[context.pipeline_data.beta < 0]):  
        print context.pipeline_data[context.pipeline_data.beta < 0].tail()

James Villa

@Grant, I cloned your algo and commented out the beta constraint. Beta increased to 0.21 (still within limits) from your 0.18 with the beta constraint that you tried to control to +- 0.05. The effect however is that your returns increased and your drawdowns improved. This confirms what I have long suspected, that the beta constraint is unnecessary under the long/short dollar neutral framework. The reason being the stock universe generally moves with the market as represented by SPY and under the long/short framework, the longs cancels the shorts so it naturally mitigates getting a high beta. By putting in the beta constraint (which does not work as you have shown) Q is actually constraining the prospect of higher returns and better drawdowns (as shown by my revision). Head scratch!

Grant Kiehne

Thanks James -

I'd still like to understand better why the beta is so high when the constraint is applied. My understanding is that it is a hard constraint, so the optimization should fail if it is not met. Perhaps there is disconnect between how beta is calculated as a performance metric and what it actually is at the optimizer output point-in-time? I'm also wondering if orders not being filled or something else to do with the ordering could be at play here?

James Villa

@Grant, I just think it just doesn't work operationally otherwise it should have given you the "Infeasible constraint" error having not been able to control the beta constraint within the +- 0.05 as you have set.

Grant Kiehne

Hi Scott -

Shouldn't mask be allowed for SimpleBeta?

 There was a runtime error.  
TypeError: __new__() got an unexpected keyword argument 'mask'  
USER ALGORITHM:46, in make_pipeline  
beta = SimpleBeta(target=sid(8554),regression_length=260,mask=QTradableStocksUS())

James Villa

@Blue Seahawk, there is nothing weird about the results of your pipeline analysis of Simple Beta. Empirical studies show that the stock universe generally moves with the market as represented by SP500 or SPY, therefore the correlation is almost always positive. This is exactly why the beta constraint in the risk model is not necessary under the long/short market neutral strategy that Q employs because the Dollar Neutral constraint already takes care of this. Put simply, if the stock universe moves with SPY, the Dollar Neutral constraint which ensures that you have a fairly equal amount of long and shorts, the net effect is that it cancels each other out, thus the market neutral part of the strategy. I'm quite suprised that Q staff /researchers/strategists/Phds did not catch something very basic about this strategy early on. The title of the strategy says it all!

Luca

@Grant -

Perhaps there is disconnect between how beta is calculated as a
performance metric and what it actually is at the optimizer output
point-in-time?

I think that is the reason why the actual algorithm beta is outside the thresholds set in the constraints. The optimizer gives in output the weights the assets must have to make the total beta stays within the constraints. The algorithm builds a portfolio with those assets weights but after entering the positions the assets change their values and by the end of the trading day (assuming the algorithm rebalances at market open) the portfolio weights diverged from the optimizer output. The returns that happen after calling the optimizer are the ones used to compute portfolio beta and this is one of the reasons that explains the issue you are seeing. The other reason is that the optimizer should know the future assets Betas to compute the correct weights but this is impossible so we are introducing another small (?) error there. Maybe computing SimpleBeta on a shorter window length might help, but that wouldn't solve the big problem that the optimizer cannot know the future.

Grant Kiehne

Thanks Luca -

Yes, I suspect that's what is going on (absent a bug or mistake in implementation). The problem I see, though, is that as I understand, Q is aiming to fund beta ~ 0 algos preferentially; in the long-run, an algo with beta ~ +/- 0.3 will be a non-starter (I think?). If one takes the QTradableStocksUS as a base universe, SimpleBeta as the beta factor, and the Optimize API FactorExposure as the only tools to keep beta in check, then they are inadequate. If zero-beta is such a desirable characteristic in hedge funds, I wonder how it is done in the industry? It seems one would need to sort out how to do a better job than the Q implementation.

Grant Kiehne

@ Scott -

Any response to my question about masking SimpleBeta? Wouldn't this provide even more speed up, if the mask is smaller than the QTradableStocksUS, for example? Generally, if it is a Pipeline factor, shouldn't it support masking?

Luca

@Grant, I understand your point and I don't have a solution to offer. I was just trying to analyze the issue, even though we might eventually discover it's a bug in the implementation. Hopefully Q will come up with an interesting solution otherwise we have to take Beta into consideration during Alpha creation and don't rely only on the optimizer to remove the market exposure from our Alpha.

Grant Kiehne

Hi Scott,

I'm wondering if you should be concerned about regression dilution? You could be underestimating the beta by using OLS.

Grant Kiehne

Hi Scott -

When I use:

SimpleBeta(target=sid(8554),regression_length=5*260)

I get the error:


ValueError: NaN or Inf values provided to FactorExposure for argument 'loadings'.  
Rows/Columns with NaNs:  
  row=Equity(22651 [HRTX]) col='beta'  
  row=Equity(28966 [TXMD]) col='beta'  
  row=Equity(42689 [PBYI]) col='beta'  
  row=Equity(42735 [MACK]) col='beta'  
  row=Equity(42746 [GLOG]) col='beta'  
  ... (37 more)  
Rows/Columns with Infs:  
  None  
There was a runtime error on line 157.

There is no error with:

SimpleBeta(target=sid(8554),regression_length=260)

Grant Kiehne

The trick is to use:

beta = SimpleBeta(target=sid(8554),  
                      regression_length=5*260,  
                      allowed_missing_percentage=1.0  
                     )

However, the help page doesn't describe what it does when it encounters all NaNs or what it does with a security that has some NaNs for returns. But at least it doesn't barf.

Grant Kiehne

In case anyone is interested, you'll find the zipline code for SimpleBeta here:

https://github.com/quantopian/zipline/blob/master/zipline/pipeline/factors/statistical.py

Also, I contacted support, and masking is not supported since SimpleBeta already runs fast enough (I didn't realize that masking was just for pipeline speed up). I recommended that it be added for consistency with other built-in factors, but I gathered from the response that the answer is "no" it won't be added.

You've successfully submitted a support ticket.

Our support team will be in touch soon.