Notebook

Building a Better Beta

Today we're announcing the release of SimpleBeta, a new Pipeline API factor that makes it easy to efficiently calculate market beta for all stocks in a trading universe.

In this notebook, we introduce the idea of market beta and show how SimpleBeta improves upon previously-available tools for calculating beta in Pipeline.

Outline

  • Introduce market beta and explain why we might care about it in the context of a trading algorithm.
  • Show how we could previously use RollingLinearRegressionOfReturns to compute market beta for a large universe of assets.
  • Show how we can build a faster beta implementation from first principles as a CustomFactor.
  • Introduce the new SimpleBeta built-in factor, which is over 130x faster than RollingLinearRegressionOfReturns and can handle missing input data.

What is Beta and Why Do We Care?

When stock prices move, they tend to move in sync with the movements of the overall market. Not all stocks exhibit this tendency equally, however: some stocks have historically been more or less sensitive to the movements of the market.

One simple way to measure how much a stock tends to move with the rest of the market is to run a linear regression between the returns of the market and the returns of the stock in question. The slope of such a regression gives us a measure of how strongly the returns of an asset tend to be affected by the returns of the market as a whole.

Since linear regression equations are often written as:

$$ Y \sim \alpha + \beta X$$

(read: Y is distributed as alpha plus beta X) we often refer to the slope of the regression line between an asset's returns and the market's returns as the asset's "market beta".

Let's see how we would calculate beta for AAPL. One simple method is to use scipy.stats.linregress to run a linear regression on daily returns. In practice we often use the price of a broad-market ETF like SPY as a proxy for the return of "the market".

In [1]:
from __future__ import print_function

import numpy as np
import pandas as pd
In [2]:
from quantopian.research import returns

START = pd.Timestamp('2010-01-04')
END = pd.Timestamp('2010-12-31')

# Load returns for AAPL (Apple Computer) and SPY (an ETF often used as a proxy for the market).
rets = returns(['AAPL', 'SPY'], START, END)
AAPL, SPY = rets.columns

# Show the first 5 rows.
rets.head()
Out[2]:
Equity(24 [AAPL]) Equity(8554 [SPY])
2010-01-04 00:00:00+00:00 0.015841 0.016872
2010-01-05 00:00:00+00:00 0.000934 0.002735
2010-01-06 00:00:00+00:00 -0.016046 0.000960
2010-01-07 00:00:00+00:00 -0.001707 0.004052
2010-01-08 00:00:00+00:00 0.006221 0.003232

linregress returns a namedtuple containing the slope, intercept, and several other useful statistics about the regression.

In [3]:
from scipy.stats import linregress
linregress(x=rets[SPY].values, y=rets[AAPL].values)
Out[3]:
LinregressResult(slope=1.0688251632444647, intercept=0.0011645457531190501, rvalue=0.71261796119874299, pvalue=2.3069534088276338e-40, stderr=0.066548758680318718)

This regression tells us that AAPL's beta to SPY was a little over 1.0 for the year of 2010. This shouldn't be terribly surprising: Apple is such a large company that its returns account for a substantial portion of the overall market's return all by itself.

We can visualize the results of a regression using matplotlib and seaborn.

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns

def regression_plot(data, size=11, legend_position='lower right'):
    """Calculate and plot a linear regression between two columns of data.
    
    Parameters
    ----------
    data : pd.DataFrame
        DataFrame with two columns to be regressed and plotted.
        The first column of the frame will be plotted on the y-axis.
        The secont column of the frame will be plotted on the x-axis.  
    figsize : tuple, optional
        Size of the resulting figure in (x_inches, y_inches)
    legend_position : string
        Location on the plot for the legend.
        
    Returns
    -------
    ax : matplotlib.axes.Axes
        Axes containing the scatterplot and regression line.
    """
    ylabel, xlabel = data.columns

    # Draw scatterplot with marginals using seaborn.
    grid = sns.jointplot(x=data[xlabel], y=data[ylabel], 
                         size=size, joint_kws={'alpha': 0.6, 'linewidth': 0})
    
    # The plot generated by seaborn is composed of three matplotlib.Axes objects.
    # The scatter plot is drawn on ax_joint. Draw a regression line on top of the scatter plot.
    axes = grid.ax_joint
    draw_regression_line(Y=data[ylabel], X=data[xlabel], ax=axes)

    # Set X and Y Axis Bounds
    min_ = data.values.min() * 1.05  # Leave a margin at the edge of the figure.
    max_ = data.values.max() * 1.05
    axes.set_ylim(min_, max_)
    axes.set_xlim(min_, max_)

    # Draw legend.
    axes.legend(fontsize=13, loc=legend_position)
    
    # Set title on the overall figure.
    grid.fig.suptitle(
        '{} Returns vs {} Returns'.format(ylabel.symbol, xlabel.symbol),
        size=18, 
        y=1.02
    ) 
    
    # Set axis label sizes.
    axes.set_ylabel(ylabel.symbol, size=15)
    axes.set_xlabel(xlabel.symbol, size=15)

    return axes

def draw_regression_line(Y, X, ax=None):
    """Calculate and plot a linear regression line.

    Parameters
    ----------
    Y : np.array or pd.Series
        Y values for the regression.
    X : np.array or pd.Series
        X values for the regression.
    ax : matplotlib.Axes, optional
        Axes on which to draw the line.
        If not passed, default is ``matplotlib.pyplot.gca()``.
    """
    if ax is None:
        ax = plt.gca()

    result = linregress(x=X, y=Y)
    slope = result.slope
    intercept = result.intercept

    x_endpoints = np.array([X.min(), X.max()]) * 2
    y_endpoints = intercept + slope * x_endpoints

    # Format a regression equation in LaTeX.
    plot_label = "${} = {:.4g}{} {} {:.4g}$".format(
        r'\mathrm{' + Y.name.symbol + '}',
        slope,
        r'\mathrm{' + X.name.symbol + '}',
        '+' if intercept > 0 else '-',
        abs(intercept),
    )

    ax.plot(x_endpoints, y_endpoints, label=plot_label, linestyle='--', color='red', linewidth=2)
    return ax
In [5]:
regression_plot(rets);

Most stocks display at least some historical tendency to move along with the overall market, but different stocks can display different sensitivities at different times.

United Steel (X), for example, has historically been more sensitive to the market than most stocks.

In [6]:
X = returns('X', START, END)
linregress(x=rets[SPY].values, y=X.values)
Out[6]:
LinregressResult(slope=1.9457293114742904, intercept=-0.00046346369353660131, rvalue=0.69485844565998056, pvalue=1.156540205721078e-37, stderr=0.12736012290975701)
In [7]:
regression_plot(returns(['X', 'SPY'], START, END));

McDonald's (MCD), on the other hand, has historically been less sensitive to changes in the market as a whole.

In [8]:
MCD = returns('MCD', START, END)
linregress(x=rets[SPY].values, y=MCD.values)
Out[8]:
LinregressResult(slope=0.56593257597506752, intercept=0.00064333935781934375, rvalue=0.66624361151729328, pvalue=1.0538531257566855e-33, stderr=0.040063203555931105)
In [9]:
regression_plot(returns(['MCD', 'SPY'], START, END));

Calculating Beta For All Stocks With Pipeline

In an algorithm that trades a large number of assets, it's often useful to know each stock's historical beta to the market. One way to try to control an algorithm's market beta, for example, is to keep the beta-weighted exposure of the algorithm's positions close to zero.

For a long time, the Pipeline API has provided a built-in factor named RollingLinearRegressionOfReturns that could be used to calculate simple linear regressions for all assets in the Quantopian database.

The implementation of RollingLinearRegressionOfReturns essentially boils down to something like this:

In [10]:
from quantopian.pipeline.factors import CustomFactor, DailyReturns

class MyRollingLinearRegressionOfReturns(CustomFactor):
    outputs = ['alpha', 'beta', 'r_value', 'p_value', 'stderr']
    inputs = [DailyReturns(), DailyReturns()[symbols('SPY')]]
    
    def compute(self, today, assets, out, dependent, independent):
        alpha = out.alpha
        beta = out.beta
        r_value = out.r_value
        p_value = out.p_value
        stderr = out.stderr

        def regress(y, x):
            regr_results = linregress(y=y, x=x)
            alpha[i] = regr_results.intercept
            beta[i] = regr_results.slope
            r_value[i] = regr_results.rvalue
            p_value[i] = regr_results.pvalue
            stderr[i] = regr_results.stderr

        for i in range(len(out)):
            regress(y=dependent[:, i].ravel(), x=independent.ravel())

(The full source for RollingLinearRegressionOfReturns as of the time of this writing can be found in Zipline here.)

This class uses the multiple outputs feature of CustomFactor to simultaneously output all the values computed by scipy.stats.linregress every day.

We can use this to calculate a rolling beta for every asset in Quantopian's database like this:

In [11]:
from quantopian.pipeline import Pipeline

regression = MyRollingLinearRegressionOfReturns(window_length=252)  # two year lookback
pipeline = Pipeline({'beta': regression.beta})

The computation graph for this pipeline looks like this:

In [12]:
pipeline.show_graph('png')
Out[12]:
In [13]:
from quantopian.research import run_pipeline

betas = run_pipeline(pipeline, END, END)
betas.head()
Out[13]:
beta
2010-12-31 00:00:00+00:00 Equity(2 [ARNC]) 1.556766
Equity(21 [AAME]) NaN
Equity(24 [AAPL]) 1.067143
Equity(25 [ARNC_PR]) NaN
Equity(31 [ABAX]) 1.188185

Unfortunately, for RollingLinearRegressionOfReturns, looping over every column of input every day and calling scipy.stats.linregress each time isn't very efficient, especially if we only care about the slope of the regression.

Looping over numpy arrays in pure python is slow, and the implementation of linregress isn't designed with the assumption that it will be called a large number of times, so it does some expensive things like defining a new namedtuple subclass on every invocation.

linregress also calculates an intercept, a p-value, an R-squared value, and a standard error of the estimate (all relatively expensive computations), but those calculations are wasted if all we care about is the slope of the regression.

We can measure roughly how long it takes to calculate betas with RollingLinearRegressionOfReturns using time.time and a bit of cleverness:

In [14]:
import time
def timed(f):
    """Take a function f and return a **new function** that calls f and returns a pair
    containing f's result and how long it took to compute.
    """
    def timed_f(*args, **kwargs):
        start = time.time()
        result = f(*args, **kwargs)
        end = time.time()
        difference = end - start
        return result, difference
    return timed_f
In [15]:
# This cell should take about 45 seconds to execute.

result, duration = timed(run_pipeline)(pipeline, '2009-01-02', '2009-01-02')
print("It took {} seconds to calculate 252-day betas for 1 day:".format(duration))

result, duration = timed(run_pipeline)(pipeline, '2009-01-02', '2009-01-10')
print("It took {} seconds to calculate 252-day betas for 5 trading days:".format(duration))
It took 6.37987399101 seconds to calculate 252-day betas for 1 day:
It took 39.0593450069 seconds to calculate 252-day betas for 5 trading days:

The 6 seconds per day translates to about 25 minutes per backtest-year, which means that a two year backtest using RollingLinearRegressionOfReturns to calculate beta for all assets might spend over an hour just calculating betas!

Fortunately for us, we can speed up our beta calculations by writing them ourselves as vectorized numpy operations.

Speeding Up Beta Calculations

Let's review the mathematics of linear regression and see if we can find a faster way to implement our beta calculation.

The simplest and most common way to calculate the slope of a simple linear regression for an independent variable $X$ and dependent variable $Y$ is to use the formula:

$$\beta = \frac{Cov*{sample}(X, Y)}{Cov*{sample}(X, X)}$$

where $Cov_{sample}(X, Y)$ is the sample covariance of $X$ and $Y$.

The $_{sample}$ subscript I'm using here is a bit of non-standard notation Covariance is a statistical measure of how much a pair of random variables "move together". Sample covariance is a particular way of estimating the covariance of a distribution from a set of observations.

There are many subtleties to robustly estimating covariance (you can read about some of those subtleties in the Quantopian Lecture Series article on covariance estimation). For now, however, we're just going to focus on re-creating what scipy.stats.linregress does, but faster.

Given observations $x_1, x_2, \ldots, x_N$ drawn from $X$, and observations $y_1, y_2, \ldots, y_N$ drawn from $Y$, the formula for sample covariance is:

$$Cov*{sample}(X, Y) = \frac{1}{N - 1}\sum*{i=1}^{N}{(x_i - \bar{X}))(y_i - \bar{Y}))}$$

where $\bar{X}$ is the sample mean of $x_1, x_2, \ldots x_N$, and $\bar{Y}$ is the sample mean of $y_1, y_2, \ldots y_N$.

Here's one way of thinking about what this formula "says":

For each pair of $(x_i, y_i)$ observations, look at how much $x_i$ is above or below the average value of all the $x$'s and multiply that value by how much $y_i$ is above or below the average value of all the $y$'s.

  • If $x_i$ and $y_i$ are both above or both below their average, we'll get a positive value for $(x_i - \bar{X})(y_i - \bar{Y})$.
  • If $x_i$ and $y_i$ are on opposited sides of their averages, we'll get a negative value for $(x_i - \bar{X})(y_i - \bar{Y})$.

To calculate sample covariance, we take the average of all of these multiplied mean-deviation samples, except we divide by $N - 1$ instead of $N$ when calculating the mean to correct for a small bias introduced by the fact that we're estimating the population mean and covariance at the same time.

  • If above-average $x$ samples tend to correspond to above-average $y$ samples in our data, then we'll end up averaging mostly positive values, so we'll estimate a positive covariance.
  • If below-average $x$ samples tend to correspond to above-average $y$ samples, we'll end up averaging mostly negative values, so we'll estimate a negative covariance.
  • If there isn't any relationship between $x$'s being above their mean and $y$'s being above their mean, then we'll end up averaging a mix of positive and negative values, so we'll end up with a covariance near zero.

Since we're interested in measuring how the returns of each asset depend on the returns of the market, our independent variable $X$ will be the returns of SPY, and our dependent variable $Y$ will be the returns of the asset whose beta we're calculating.

To get started, lets see if we can speed up our single-stock beta calculation by copying the formula above as closely as possible.

In [16]:
def timeit(f):
    """Take a function f and return a new function that calls 5000 times,
    returning the last result and the average runtime, ignoring the ten 
    slowest/fastest calls. This gives a more reliable estimate for fast functions.
    """
    _time=time.time
    def timed_f(*args, **kwargs):
        times = []
        for _ in range(5000):
            start = _time()
            result = f(*args, **kwargs)
            end = _time()
            times.append(end - start)
        # Take the average of the middle three to smooth out variance from cache effects.
        average_time = sum(sorted(times)[10:-10]) / (len(times) - 20)
        return result, average_time
    return timed_f
In [17]:
def beta_v0(spy, asset):
    """Calculate the slope of a linear regression for one asset with SPY.
    """
    asset_residuals = asset - asset.mean()
    spy_residuals = spy - spy.mean()
    
    covariance = (asset_residuals * spy_residuals).sum() / (len(asset) - 1)
    spy_variance = (spy_residuals ** 2).sum() / (len(asset) - 1)
    
    return covariance / spy_variance 
    
v0_result, v0_duration = timeit(beta_v0)(rets[SPY].values, rets[AAPL].values)
print("Our AAPL beta is {}. Calculation took {} seconds.".format(v0_result, v0_duration))

scipy_result, scipy_duration = timeit(linregress)(rets[SPY].values, rets[AAPL].values)
print("linregress AAPL beta is {.slope}. Calculation took {} seconds.".format(
    scipy_result, scipy_duration
))
print()
print("Our calculation was {} times faster than linregress.".format(scipy_duration / v0_duration))
Our AAPL beta is 1.06882516324. Calculation took 2.93323313855e-05 seconds.
linregress AAPL beta is 1.06882516324. Calculation took 0.000671482660684 seconds.

Our calculation was 22.8922362788 times faster than linregress.

So far so good. We can reproduce scipy's result exactly, and we've gotten a significant speedup by avoiding extra calculations we didn't need.

We can speed up our implementation a bit more by noticing that we're dividing both covariance and spy_variance by $N - 1$ even though the next thing we do is divide covariance by spy_variance. Since the divisions by $N - 1$ are going to end up canceling out, we don't need to bother with them in the first place.

In [18]:
def beta_v1(spy, asset):
    """Calculate the slope of a linear regression for one asset with SPY.
    """
    asset_residuals = asset - asset.mean()
    spy_residuals = spy - spy.mean()
    
    # These variable names aren't really correct anymore since we're
    # taking a sum instead of a mean, but I couldn't think of better names.
    covariance = (asset_residuals * spy_residuals).sum()  # Not dividing here anymore.
    spy_variance = (spy_residuals ** 2).sum()             # Not dividing here anymore.
    
    return covariance / spy_variance 

v1_result, v1_duration = timeit(beta_v1)(rets[SPY].values, rets[AAPL].values)
print("Our AAPL beta is still {}. Calculation took {} seconds.".format(v1_result, v1_duration))
print("Our calculation was {} times faster than linregress.".format(scipy_duration / v1_duration))
Our AAPL beta is still 1.06882516324. Calculation took 2.81569469406e-05 seconds.
Our calculation was 23.8478504825 times faster than linregress.

Once we have an efficient single-asset implementation of beta, we can adapt it to efficiently perform a vectorized regression against multiple assets using multi-dimensional arrays and numpy broadcasting.

In [19]:
def vectorized_beta(spy, assets):
    """Calculate beta between every column of ``assets`` and ``spy``.
    
    Parameters
    ----------
    spy : np.array
        An (n x 1) array of returns for SPY.
    assets : np.array
        An (n x m) array of returns for m assets.
    """
    assert len(spy.shape) == 2 and spy.shape[1] == 1, "Expected a column vector for spy."

    asset_residuals = assets - assets.mean(axis=0)
    spy_residuals = spy - spy.mean()

    covariances = (asset_residuals * spy_residuals).sum(axis=0)
    spy_variance = (spy_residuals ** 2).sum()
    return covariances / spy_variance

Let's see how our vectorized beta handles calculating beta for every asset in the QTradableStocksUS.

In [20]:
from quantopian.pipeline.experimental import QTradableStocksUS

def get_qtradable_for(day):
    """Get the QTradableStocksUS for a single day.
    """
    pipe = Pipeline({}, screen=QTradableStocksUS())
    result = run_pipeline(pipe, day, day)
    return result.index.get_level_values(1)

universe = get_qtradable_for(END)
print(universe[:5])
Index([ Equity(2 [ARNC]), Equity(24 [AAPL]), Equity(41 [ARCB]),
        Equity(52 [ABM]),  Equity(62 [ABT])],
      dtype='object')
In [21]:
all_returns = returns(universe, START, END).dropna(how='any', axis=1)
all_returns.head()
Out[21]:
Equity(2 [ARNC]) Equity(24 [AAPL]) Equity(41 [ARCB]) Equity(52 [ABM]) Equity(62 [ABT]) Equity(64 [ABX]) Equity(67 [ADSK]) Equity(76 [TAP]) Equity(88 [ACI]) Equity(107 [ACV]) ... Equity(38936 [DG]) Equity(38944 [RUE]) Equity(38965 [FTNT]) Equity(38971 [CLD]) Equity(38989 [AOL]) Equity(39047 [PEB]) Equity(39053 [CIT]) Equity(39073 [CIE]) Equity(39079 [KRA]) Equity(39111 [PPC])
2010-01-04 00:00:00+00:00 0.032252 0.015841 -0.030593 0.029578 0.010002 0.024628 0.010232 0.018168 0.055314 0.020497 ... 0.026304 -0.017475 0.024474 0.019231 0.023555 -0.030863 0.026792 0.037037 0.015453 0.002247
2010-01-05 00:00:00+00:00 -0.031244 0.000934 -0.003839 -0.006422 -0.009163 0.012631 -0.015193 -0.013494 0.045354 -0.009028 ... 0.009557 0.004356 0.019444 0.017520 0.030962 0.018231 0.049718 0.053922 0.044928 -0.035874
2010-01-06 00:00:00+00:00 0.050914 -0.016046 -0.007036 0.001944 0.006472 0.021761 0.002373 -0.001334 0.062531 -0.004743 ... 0.016351 0.068666 0.059946 0.070861 0.006494 0.000463 0.065166 -0.011296 0.012483 0.029070
2010-01-07 00:00:00+00:00 -0.020678 -0.001707 -0.014528 0.010428 0.008104 -0.014118 0.005525 -0.015010 0.026052 0.006445 ... 0.003810 0.035847 -0.002057 0.012369 -0.006855 -0.017943 0.000315 0.016801 0.000000 -0.018079
2010-01-08 00:00:00+00:00 0.024705 0.006221 0.016513 -0.000432 0.004567 0.005811 0.030534 -0.002459 0.002243 -0.018872 ... 0.009701 0.061051 0.034003 0.013439 0.042631 -0.011207 0.016709 0.019828 0.020548 -0.010357

5 rows × 1755 columns

In [22]:
# Convert 1D array of spy returns into an N x 1 2D array.
SPY_column = rets[SPY].values[:, np.newaxis]
betas, duration = timed(vectorized_beta)(SPY_column, all_returns.values)

print("AAPL beta is still", betas[all_returns.columns.get_loc(AAPL)])
print("It took {} seconds to calculate betas for {} stocks.".format(duration, len(betas)))
AAPL beta is still 1.06882516324
It took 0.00327587127686 seconds to calculate betas for 1755 stocks.

Now we're starting to get somewhere! If we spend 0.002 seconds calculating betas every trading day, then that means we're only spending about half a second per backtest-year on calculating beta.

In [23]:
0.002 * 252
Out[23]:
0.504

Let's see how we would plug our vectorized_beta function into a CustomFactor:

In [24]:
SPY_asset = symbols('SPY')

daily_returns = DailyReturns()

class MyBeta(CustomFactor):
    # Get daily returns for every asset in existence, plus the daily returns for just SPY
    # as a column vector.
    inputs = [daily_returns, daily_returns[SPY_asset]]
    # Set a default window length of 2 years.
    window_length = 252
    
    def compute(self, today, assets, out, all_returns, spy_returns):
        out[:] = vectorized_beta(spy_returns, all_returns)
In [25]:
my_beta_pipeline = Pipeline({'beta': MyBeta()})
my_beta_pipeline.show_graph('png')
Out[25]:
In [26]:
my_betas, duration = timed(run_pipeline)(my_beta_pipeline, END, END)
print("AAPL's beta is still", my_betas.xs(AAPL, level=1).values[0, 0])
print("It took {} seconds to calculate 252-day betas for 1 day:".format(duration))
AAPL's beta is still 1.06714326974
It took 0.623263835907 seconds to calculate 252-day betas for 1 day:

We've gone from spending 6 seconds to calculate betas for a day to spending just over half a second!

We're actually doing even better than it might seem from those numbers: most of that half second was spent loading pricing data, not calculating betas, so our scaling will be much better if we run for a longer period:

In [27]:
from zipline.utils.calendars import get_calendar
NYSE = get_calendar('NYSE')
In [28]:
my_beta_result, duration = timed(run_pipeline)(my_beta_pipeline, END, END + (5 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for 5 trading days:".format(duration))

my_beta_result, duration = timed(run_pipeline)(my_beta_pipeline, END, END + (40 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for two months:".format(duration))

my_beta_result, duration = timed(run_pipeline)(my_beta_pipeline, END, END + (252 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for a year:".format(duration))
It took 0.695220947266 seconds to calculate 252-day betas for 5 trading days:
It took 1.36936211586 seconds to calculate 252-day betas for two months:
It took 5.96089887619 seconds to calculate 252-day betas for a year:

Our updated MyBeta factor can calculate betas for a year in less than the time it used to take to calculate one day!

That's more than a 250x speedup!

The SimpleBeta Built-In Factor

There are still a few ways that we can improve our MyBeta factor:

  • We can add support for regressing against assets other than SPY (for example, against sector ETFs).
  • We can add better support for dealing with missing data (for example, if we're calculating a 1 year beta, we might still want to compute a "best-effort" beta for stocks that IPO'ed six months ago using the data we have.)
  • We can speed up performance a bit more by playing a few tricks to avoid making unnecessary copies when possible.

Fortunately, we don't need to implement all of those improvements here, because we can just use quantopian.pipeline.factors.SimpleBeta, which implements them for us!

In [29]:
from quantopian.pipeline.factors import SimpleBeta
print(SimpleBeta.__doc__)
    Factor producing the slope of a regression line between each asset's daily
    returns to the daily returns of a single "target" asset.

    Parameters
    ----------
    target : zipline.Asset
        Asset against which other assets should be regressed.
    regression_length : int
        Number of days of daily returns to use for the regression.
    allowed_missing_percentage : float, optional
        Percentage of returns observations that are allowed to be missing when
        calculating betas. Assets with more than this percentage of returns
        observations missing will produce values of NaN. Default behavior is
        that 25% of inputs can be missing.
    

SimpleBeta has two required parameters:

  • target, the asset whose returns should be used as the independent regression variable.
  • regression_length, an integer indicating the number of days of data that should be used to calculate regressions.

SimpleBeta also accepts an parameter, allowed_missing_percentage, which controls how missing data is handled. By default, SimpleBeta will produce a beta for an asset as long as we have valid returns observations on at least 75% of days. You can tune this behavior by passing a value between 0 and 1 as allowed_missing_percentage. For example, passing 0.1 would mean that only 10% of values are allowed to be missing, while passing 0.9 would mean that 90% of values can be missing.

Here's an example of using SimpleBeta to compute the same 252-day betas we've been calculating, but with robust handling of missing values:

In [30]:
SPY = symbols('SPY')
simple_beta = SimpleBeta(SPY, 252, allowed_missing_percentage=0.3)
simple_beta_pipeline = Pipeline({'beta': simple_beta})
In [31]:
simple_beta_result, duration = timed(run_pipeline)(simple_beta_pipeline, END, END + (252 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for a year:".format(duration))
It took 11.8971569538 seconds to calculate 252-day betas for a year:

Because SimpleBeta robustly handles missing data, it produces many fewer NaN outputs than our MyBeta factor:

In [32]:
print("SimpleBeta NaN count for 2010:", simple_beta_result.beta.isnull().sum())
print("MyBeta NaN count for 2010:", my_beta_result.beta.isnull().sum())
SimpleBeta NaN count for 2010: 299057
MyBeta NaN count for 2010: 630067

Review and Conclusions

In this notebook, we:

  • Introduced market beta and explain why we might care about it in the context of a trading algorithm.
  • Looked at the RollingLinearRegressionOfReturns Pipeline factor and showed how we could build a much faster vectorized implementation of beta ourselves using numpy.
  • Learned about the new SimpleBeta built-in factor.

If you want to learn more about how these improvements were implemented, you can see the source code as of the time of this writing here in Zipline. You can also see the Zipline pull request here.

Future Work

There are many different ways to estimate beta. In this notebook, we focused on a the simple case of calculating beta by running an OLS regression on daily returns observations, but there are many more sophisticated estimation methods we might consider in the future.

It's common, for example, to use a Shrinkage Estimator for beta, which allows us to combine a "raw" beta estimate with a prior belief about how we expect stocks to behave. It's also common to use longer time intervals for returns (e.g., weekly rather than daily returns), to try to smooth out daily fluctuations.