Today we're announcing the release of SimpleBeta
, a new Pipeline API factor that makes it easy to efficiently calculate market beta for all stocks in a trading universe.
In this notebook, we introduce the idea of market beta and show how SimpleBeta
improves upon previously-available tools for calculating beta in Pipeline.
RollingLinearRegressionOfReturns
to compute market beta for a large universe of assets.beta
implementation from first principles as a CustomFactor
.SimpleBeta
built-in factor, which is over 130x faster than RollingLinearRegressionOfReturns
and can handle missing input data.When stock prices move, they tend to move in sync with the movements of the overall market. Not all stocks exhibit this tendency equally, however: some stocks have historically been more or less sensitive to the movements of the market.
One simple way to measure how much a stock tends to move with the rest of the market is to run a linear regression between the returns of the market and the returns of the stock in question. The slope of such a regression gives us a measure of how strongly the returns of an asset tend to be affected by the returns of the market as a whole.
Since linear regression equations are often written as:
$$ Y \sim \alpha + \beta X$$
(read: Y is distributed as alpha plus beta X) we often refer to the slope of the regression line between an asset's returns and the market's returns as the asset's "market beta".
Let's see how we would calculate beta for AAPL. One simple method is to use scipy.stats.linregress
to run a linear regression on daily returns. In practice we often use the price of a broad-market ETF like SPY
as a proxy for the return of "the market".
from __future__ import print_function
import numpy as np
import pandas as pd
from quantopian.research import returns
START = pd.Timestamp('2010-01-04')
END = pd.Timestamp('2010-12-31')
# Load returns for AAPL (Apple Computer) and SPY (an ETF often used as a proxy for the market).
rets = returns(['AAPL', 'SPY'], START, END)
AAPL, SPY = rets.columns
# Show the first 5 rows.
rets.head()
linregress
returns a namedtuple containing the slope, intercept, and several other useful statistics about the regression.
from scipy.stats import linregress
linregress(x=rets[SPY].values, y=rets[AAPL].values)
This regression tells us that AAPL
's beta to SPY
was a little over 1.0 for the year of 2010. This shouldn't be terribly surprising: Apple is such a large company that its returns account for a substantial portion of the overall market's return all by itself.
We can visualize the results of a regression using matplotlib and seaborn.
import matplotlib.pyplot as plt
import seaborn as sns
def regression_plot(data, size=11, legend_position='lower right'):
"""Calculate and plot a linear regression between two columns of data.
Parameters
----------
data : pd.DataFrame
DataFrame with two columns to be regressed and plotted.
The first column of the frame will be plotted on the y-axis.
The secont column of the frame will be plotted on the x-axis.
figsize : tuple, optional
Size of the resulting figure in (x_inches, y_inches)
legend_position : string
Location on the plot for the legend.
Returns
-------
ax : matplotlib.axes.Axes
Axes containing the scatterplot and regression line.
"""
ylabel, xlabel = data.columns
# Draw scatterplot with marginals using seaborn.
grid = sns.jointplot(x=data[xlabel], y=data[ylabel],
size=size, joint_kws={'alpha': 0.6, 'linewidth': 0})
# The plot generated by seaborn is composed of three matplotlib.Axes objects.
# The scatter plot is drawn on ax_joint. Draw a regression line on top of the scatter plot.
axes = grid.ax_joint
draw_regression_line(Y=data[ylabel], X=data[xlabel], ax=axes)
# Set X and Y Axis Bounds
min_ = data.values.min() * 1.05 # Leave a margin at the edge of the figure.
max_ = data.values.max() * 1.05
axes.set_ylim(min_, max_)
axes.set_xlim(min_, max_)
# Draw legend.
axes.legend(fontsize=13, loc=legend_position)
# Set title on the overall figure.
grid.fig.suptitle(
'{} Returns vs {} Returns'.format(ylabel.symbol, xlabel.symbol),
size=18,
y=1.02
)
# Set axis label sizes.
axes.set_ylabel(ylabel.symbol, size=15)
axes.set_xlabel(xlabel.symbol, size=15)
return axes
def draw_regression_line(Y, X, ax=None):
"""Calculate and plot a linear regression line.
Parameters
----------
Y : np.array or pd.Series
Y values for the regression.
X : np.array or pd.Series
X values for the regression.
ax : matplotlib.Axes, optional
Axes on which to draw the line.
If not passed, default is ``matplotlib.pyplot.gca()``.
"""
if ax is None:
ax = plt.gca()
result = linregress(x=X, y=Y)
slope = result.slope
intercept = result.intercept
x_endpoints = np.array([X.min(), X.max()]) * 2
y_endpoints = intercept + slope * x_endpoints
# Format a regression equation in LaTeX.
plot_label = "${} = {:.4g}{} {} {:.4g}$".format(
r'\mathrm{' + Y.name.symbol + '}',
slope,
r'\mathrm{' + X.name.symbol + '}',
'+' if intercept > 0 else '-',
abs(intercept),
)
ax.plot(x_endpoints, y_endpoints, label=plot_label, linestyle='--', color='red', linewidth=2)
return ax
regression_plot(rets);
Most stocks display at least some historical tendency to move along with the overall market, but different stocks can display different sensitivities at different times.
United Steel (X
), for example, has historically been more sensitive to the market than most stocks.
X = returns('X', START, END)
linregress(x=rets[SPY].values, y=X.values)
regression_plot(returns(['X', 'SPY'], START, END));
McDonald's (MCD
), on the other hand, has historically been less sensitive to changes in the market as a whole.
MCD = returns('MCD', START, END)
linregress(x=rets[SPY].values, y=MCD.values)
regression_plot(returns(['MCD', 'SPY'], START, END));
In an algorithm that trades a large number of assets, it's often useful to know each stock's historical beta to the market. One way to try to control an algorithm's market beta, for example, is to keep the beta-weighted exposure of the algorithm's positions close to zero.
For a long time, the Pipeline API has provided a built-in factor named RollingLinearRegressionOfReturns
that could be used to calculate simple linear regressions for all assets in the Quantopian database.
The implementation of RollingLinearRegressionOfReturns
essentially boils down to something like this:
from quantopian.pipeline.factors import CustomFactor, DailyReturns
class MyRollingLinearRegressionOfReturns(CustomFactor):
outputs = ['alpha', 'beta', 'r_value', 'p_value', 'stderr']
inputs = [DailyReturns(), DailyReturns()[symbols('SPY')]]
def compute(self, today, assets, out, dependent, independent):
alpha = out.alpha
beta = out.beta
r_value = out.r_value
p_value = out.p_value
stderr = out.stderr
def regress(y, x):
regr_results = linregress(y=y, x=x)
alpha[i] = regr_results.intercept
beta[i] = regr_results.slope
r_value[i] = regr_results.rvalue
p_value[i] = regr_results.pvalue
stderr[i] = regr_results.stderr
for i in range(len(out)):
regress(y=dependent[:, i].ravel(), x=independent.ravel())
(The full source for RollingLinearRegressionOfReturns
as of the time of this writing can be found in Zipline here.)
This class uses the multiple outputs feature of CustomFactor
to simultaneously output all the values computed by scipy.stats.linregress
every day.
We can use this to calculate a rolling beta for every asset in Quantopian's database like this:
from quantopian.pipeline import Pipeline
regression = MyRollingLinearRegressionOfReturns(window_length=252) # two year lookback
pipeline = Pipeline({'beta': regression.beta})
The computation graph for this pipeline looks like this:
pipeline.show_graph('png')
from quantopian.research import run_pipeline
betas = run_pipeline(pipeline, END, END)
betas.head()
Unfortunately, for RollingLinearRegressionOfReturns
, looping over every column of input every day and calling scipy.stats.linregress
each time isn't very efficient, especially if we only care about the slope of the regression.
Looping over numpy arrays in pure python is slow, and the implementation of linregress
isn't designed with the assumption that it will be called a large number of times, so it does some expensive things like defining a new namedtuple subclass on every invocation.
linregress
also calculates an intercept, a p-value, an R-squared value, and a standard error of the estimate (all relatively expensive computations), but those calculations are wasted if all we care about is the slope of the regression.
We can measure roughly how long it takes to calculate betas with RollingLinearRegressionOfReturns
using time.time
and a bit of cleverness:
import time
def timed(f):
"""Take a function f and return a **new function** that calls f and returns a pair
containing f's result and how long it took to compute.
"""
def timed_f(*args, **kwargs):
start = time.time()
result = f(*args, **kwargs)
end = time.time()
difference = end - start
return result, difference
return timed_f
# This cell should take about 45 seconds to execute.
result, duration = timed(run_pipeline)(pipeline, '2009-01-02', '2009-01-02')
print("It took {} seconds to calculate 252-day betas for 1 day:".format(duration))
result, duration = timed(run_pipeline)(pipeline, '2009-01-02', '2009-01-10')
print("It took {} seconds to calculate 252-day betas for 5 trading days:".format(duration))
The 6 seconds per day translates to about 25 minutes per backtest-year, which means that a two year backtest using RollingLinearRegressionOfReturns
to calculate beta for all assets might spend over an hour just calculating betas!
Fortunately for us, we can speed up our beta calculations by writing them ourselves as vectorized numpy operations.
Let's review the mathematics of linear regression and see if we can find a faster way to implement our beta calculation.
The simplest and most common way to calculate the slope of a simple linear regression for an independent variable $X$ and dependent variable $Y$ is to use the formula:
$$\beta = \frac{Cov*{sample}(X, Y)}{Cov*{sample}(X, X)}$$
where $Cov_{sample}(X, Y)$ is the sample covariance of $X$ and $Y$.
The $_{sample}$ subscript I'm using here is a bit of non-standard notation Covariance is a statistical measure of how much a pair of random variables "move together". Sample covariance is a particular way of estimating the covariance of a distribution from a set of observations.
There are many subtleties to robustly estimating covariance (you can read about some of those subtleties in the Quantopian Lecture Series article on covariance estimation). For now, however, we're just going to focus on re-creating what scipy.stats.linregress
does, but faster.
Given observations $x_1, x_2, \ldots, x_N$ drawn from $X$, and observations $y_1, y_2, \ldots, y_N$ drawn from $Y$, the formula for sample covariance is:
$$Cov*{sample}(X, Y) = \frac{1}{N - 1}\sum*{i=1}^{N}{(x_i - \bar{X}))(y_i - \bar{Y}))}$$
where $\bar{X}$ is the sample mean of $x_1, x_2, \ldots x_N$, and $\bar{Y}$ is the sample mean of $y_1, y_2, \ldots y_N$.
Here's one way of thinking about what this formula "says":
For each pair of $(x_i, y_i)$ observations, look at how much $x_i$ is above or below the average value of all the $x$'s and multiply that value by how much $y_i$ is above or below the average value of all the $y$'s.
- If $x_i$ and $y_i$ are both above or both below their average, we'll get a positive value for $(x_i - \bar{X})(y_i - \bar{Y})$.
- If $x_i$ and $y_i$ are on opposited sides of their averages, we'll get a negative value for $(x_i - \bar{X})(y_i - \bar{Y})$.
To calculate sample covariance, we take the average of all of these multiplied mean-deviation samples, except we divide by $N - 1$ instead of $N$ when calculating the mean to correct for a small bias introduced by the fact that we're estimating the population mean and covariance at the same time.
- If above-average $x$ samples tend to correspond to above-average $y$ samples in our data, then we'll end up averaging mostly positive values, so we'll estimate a positive covariance.
- If below-average $x$ samples tend to correspond to above-average $y$ samples, we'll end up averaging mostly negative values, so we'll estimate a negative covariance.
- If there isn't any relationship between $x$'s being above their mean and $y$'s being above their mean, then we'll end up averaging a mix of positive and negative values, so we'll end up with a covariance near zero.
Since we're interested in measuring how the returns of each asset depend on the returns of the market, our independent variable $X$ will be the returns of SPY
, and our dependent variable $Y$ will be the returns of the asset whose beta we're calculating.
To get started, lets see if we can speed up our single-stock beta calculation by copying the formula above as closely as possible.
def timeit(f):
"""Take a function f and return a new function that calls 5000 times,
returning the last result and the average runtime, ignoring the ten
slowest/fastest calls. This gives a more reliable estimate for fast functions.
"""
_time=time.time
def timed_f(*args, **kwargs):
times = []
for _ in range(5000):
start = _time()
result = f(*args, **kwargs)
end = _time()
times.append(end - start)
# Take the average of the middle three to smooth out variance from cache effects.
average_time = sum(sorted(times)[10:-10]) / (len(times) - 20)
return result, average_time
return timed_f
def beta_v0(spy, asset):
"""Calculate the slope of a linear regression for one asset with SPY.
"""
asset_residuals = asset - asset.mean()
spy_residuals = spy - spy.mean()
covariance = (asset_residuals * spy_residuals).sum() / (len(asset) - 1)
spy_variance = (spy_residuals ** 2).sum() / (len(asset) - 1)
return covariance / spy_variance
v0_result, v0_duration = timeit(beta_v0)(rets[SPY].values, rets[AAPL].values)
print("Our AAPL beta is {}. Calculation took {} seconds.".format(v0_result, v0_duration))
scipy_result, scipy_duration = timeit(linregress)(rets[SPY].values, rets[AAPL].values)
print("linregress AAPL beta is {.slope}. Calculation took {} seconds.".format(
scipy_result, scipy_duration
))
print()
print("Our calculation was {} times faster than linregress.".format(scipy_duration / v0_duration))
So far so good. We can reproduce scipy's result exactly, and we've gotten a significant speedup by avoiding extra calculations we didn't need.
We can speed up our implementation a bit more by noticing that we're dividing both covariance
and spy_variance
by $N - 1$ even though the next thing we do is divide covariance
by spy_variance
. Since the divisions by $N - 1$ are going to end up canceling out, we don't need to bother with them in the first place.
def beta_v1(spy, asset):
"""Calculate the slope of a linear regression for one asset with SPY.
"""
asset_residuals = asset - asset.mean()
spy_residuals = spy - spy.mean()
# These variable names aren't really correct anymore since we're
# taking a sum instead of a mean, but I couldn't think of better names.
covariance = (asset_residuals * spy_residuals).sum() # Not dividing here anymore.
spy_variance = (spy_residuals ** 2).sum() # Not dividing here anymore.
return covariance / spy_variance
v1_result, v1_duration = timeit(beta_v1)(rets[SPY].values, rets[AAPL].values)
print("Our AAPL beta is still {}. Calculation took {} seconds.".format(v1_result, v1_duration))
print("Our calculation was {} times faster than linregress.".format(scipy_duration / v1_duration))
Once we have an efficient single-asset implementation of beta, we can adapt it to efficiently perform a vectorized regression against multiple assets using multi-dimensional arrays and numpy broadcasting.
def vectorized_beta(spy, assets):
"""Calculate beta between every column of ``assets`` and ``spy``.
Parameters
----------
spy : np.array
An (n x 1) array of returns for SPY.
assets : np.array
An (n x m) array of returns for m assets.
"""
assert len(spy.shape) == 2 and spy.shape[1] == 1, "Expected a column vector for spy."
asset_residuals = assets - assets.mean(axis=0)
spy_residuals = spy - spy.mean()
covariances = (asset_residuals * spy_residuals).sum(axis=0)
spy_variance = (spy_residuals ** 2).sum()
return covariances / spy_variance
Let's see how our vectorized beta handles calculating beta for every asset in the QTradableStocksUS
.
from quantopian.pipeline.experimental import QTradableStocksUS
def get_qtradable_for(day):
"""Get the QTradableStocksUS for a single day.
"""
pipe = Pipeline({}, screen=QTradableStocksUS())
result = run_pipeline(pipe, day, day)
return result.index.get_level_values(1)
universe = get_qtradable_for(END)
print(universe[:5])
all_returns = returns(universe, START, END).dropna(how='any', axis=1)
all_returns.head()
# Convert 1D array of spy returns into an N x 1 2D array.
SPY_column = rets[SPY].values[:, np.newaxis]
betas, duration = timed(vectorized_beta)(SPY_column, all_returns.values)
print("AAPL beta is still", betas[all_returns.columns.get_loc(AAPL)])
print("It took {} seconds to calculate betas for {} stocks.".format(duration, len(betas)))
Now we're starting to get somewhere! If we spend 0.002 seconds calculating betas every trading day, then that means we're only spending about half a second per backtest-year on calculating beta.
0.002 * 252
Let's see how we would plug our vectorized_beta
function into a CustomFactor
:
SPY_asset = symbols('SPY')
daily_returns = DailyReturns()
class MyBeta(CustomFactor):
# Get daily returns for every asset in existence, plus the daily returns for just SPY
# as a column vector.
inputs = [daily_returns, daily_returns[SPY_asset]]
# Set a default window length of 2 years.
window_length = 252
def compute(self, today, assets, out, all_returns, spy_returns):
out[:] = vectorized_beta(spy_returns, all_returns)
my_beta_pipeline = Pipeline({'beta': MyBeta()})
my_beta_pipeline.show_graph('png')
my_betas, duration = timed(run_pipeline)(my_beta_pipeline, END, END)
print("AAPL's beta is still", my_betas.xs(AAPL, level=1).values[0, 0])
print("It took {} seconds to calculate 252-day betas for 1 day:".format(duration))
We've gone from spending 6 seconds to calculate betas for a day to spending just over half a second!
We're actually doing even better than it might seem from those numbers: most of that half second was spent loading pricing data, not calculating betas, so our scaling will be much better if we run for a longer period:
from zipline.utils.calendars import get_calendar
NYSE = get_calendar('NYSE')
my_beta_result, duration = timed(run_pipeline)(my_beta_pipeline, END, END + (5 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for 5 trading days:".format(duration))
my_beta_result, duration = timed(run_pipeline)(my_beta_pipeline, END, END + (40 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for two months:".format(duration))
my_beta_result, duration = timed(run_pipeline)(my_beta_pipeline, END, END + (252 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for a year:".format(duration))
Our updated MyBeta
factor can calculate betas for a year in less than the time it used to take to calculate one day!
That's more than a 250x speedup!
SimpleBeta
Built-In Factor
¶There are still a few ways that we can improve our MyBeta
factor:
SPY
(for example, against sector ETFs).Fortunately, we don't need to implement all of those improvements here, because we can just use quantopian.pipeline.factors.SimpleBeta
, which implements them for us!
from quantopian.pipeline.factors import SimpleBeta
print(SimpleBeta.__doc__)
SimpleBeta
has two required parameters:
target
, the asset whose returns should be used as the independent regression variable.regression_length
, an integer indicating the number of days of data that should be used to calculate regressions.SimpleBeta
also accepts an parameter, allowed_missing_percentage
, which controls how missing data is handled. By default, SimpleBeta
will produce a beta for an asset as long as we have valid returns observations on at least 75% of days. You can tune this behavior by passing a value between 0 and 1 as allowed_missing_percentage
. For example, passing 0.1
would mean that only 10% of values are allowed to be missing, while passing 0.9
would mean that 90% of values can be missing.
Here's an example of using SimpleBeta
to compute the same 252-day betas we've been calculating, but with robust handling of missing values:
SPY = symbols('SPY')
simple_beta = SimpleBeta(SPY, 252, allowed_missing_percentage=0.3)
simple_beta_pipeline = Pipeline({'beta': simple_beta})
simple_beta_result, duration = timed(run_pipeline)(simple_beta_pipeline, END, END + (252 * NYSE.day))
print("It took {} seconds to calculate 252-day betas for a year:".format(duration))
Because SimpleBeta
robustly handles missing data, it produces many fewer NaN
outputs than our MyBeta
factor:
print("SimpleBeta NaN count for 2010:", simple_beta_result.beta.isnull().sum())
print("MyBeta NaN count for 2010:", my_beta_result.beta.isnull().sum())
In this notebook, we:
RollingLinearRegressionOfReturns
Pipeline factor and showed how we could build a much faster vectorized implementation of beta
ourselves using numpy.SimpleBeta
built-in factor.If you want to learn more about how these improvements were implemented, you can see the source code as of the time of this writing here in Zipline. You can also see the Zipline pull request here.
There are many different ways to estimate beta. In this notebook, we focused on a the simple case of calculating beta by running an OLS regression on daily returns observations, but there are many more sophisticated estimation methods we might consider in the future.
It's common, for example, to use a Shrinkage Estimator for beta, which allows us to combine a "raw" beta estimate with a prior belief about how we expect stocks to behave. It's also common to use longer time intervals for returns (e.g., weekly rather than daily returns), to try to smooth out daily fluctuations.