Notebook

An updated method to analyze alpha factors

by Thomas Wiecki, Quantopian, 2019.

We recently released a great alphalens tutorial. While that represents the perfect introduction for analyzing factors, we are also constantly evolving our thinking and analyses. In this post, I want to give people an updated but less polished way of analyzing factors. In addition, this notebook contains some updated thoughts on what constitutes a good factor and tips on how to build it that we have not shared before. Thus, if you want to increase your chances of scoring well in the contest or getting an allocation, I think this is a good resource to study.

While these new analyses use alphalens functionality they do go beyond it. At the same time, they are much more succinct, so if before you refrained from using alphalens because it seemed daunting, check out these few plots which hopefully give you a sense of what we look for in a factor. At some point in the future, we will probably add this functionality to alphalens as well.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import empyrical as ep
import alphalens as al
import pyfolio as pf

from quantopian.research import run_pipeline, returns
from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.pipeline.factors import Returns, SimpleMovingAverage
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.pipeline.data import EquityPricing
from quantopian.research.experimental import get_factor_returns, get_factor_loadings

Let's first define a simple factor -- here we will use Moneyflow.

In [2]:
class MoneyflowVolume5d(CustomFactor):
    inputs = (EquityPricing.close, EquityPricing.volume)

    # we need one more day to get the direction of the price on the first
    # day of our desired window of 5 days
    window_length = 6

    def compute(self, today, assets, out, close_extra, volume_extra):
        # slice off the extra row used to get the direction of the close
        # on the first day
        close = close_extra[1:]
        volume = volume_extra[1:]

        dollar_volume = close * volume
        denominator = dollar_volume.sum(axis=0)

        difference = np.diff(close_extra, axis=0)
        direction = np.where(difference > 0, 1, -1)
        numerator = (direction * dollar_volume).sum(axis=0)

        np.divide(numerator, denominator, out=out)
In [10]:
universe = QTradableStocksUS()
pipeline_factor = MoneyflowVolume5d()
pipe = Pipeline(screen=universe, columns={'alpha': pipeline_factor})
start = pd.Timestamp("2010-01-05")
end = pd.Timestamp("2017-01-01")
results = run_pipeline(pipe, start_date=start, end_date=end).dropna()
# Normalize factor values
port = results['alpha'].divide(results['alpha'].abs().groupby(level=0).sum(), level=0)

Pipeline Execution Time: 13.55 Seconds

Load pricing and risk data:

In [9]:
assets = results.index.levels[1]
pricing = get_pricing(assets, start, end + pd.Timedelta(days=30), fields="close_price")
stock_rets = pricing.pct_change()

# Load risk factor loadings and returns
factor_loadings = get_factor_loadings(assets, start, end + pd.Timedelta(days=30))
factor_returns = get_factor_returns(start, end + pd.Timedelta(days=30))
# Fix a bug in the risk returns
factor_returns.loc[factor_returns.value.idxmax(), 'value'] = 0
/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

Next we define all the plotting functionality we will use here. This is all in a single cell so that you can easily copy & paste it to your own NB. We will go through each plot in more detail below.

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import empyrical as ep
import alphalens as al
import pyfolio as pf
   
def calc_perf_attrib(portfolio_returns, portfolio_pos, factor_returns, factor_loadings):
    import empyrical as ep
    start = portfolio_returns.index[0]
    end = portfolio_returns.index[-1]
    factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
    portfolio_pos.index = portfolio_pos.index.set_names(['dt'])
    
    portfolio_pos = portfolio_pos.drop('cash', axis=1)
    portfolio_pos.columns.name = 'ticker'
    portfolio_pos.columns = portfolio_pos.columns.astype('int')
    
    return ep.perf_attrib(
        portfolio_returns, 
        portfolio_pos.stack().dropna(),
        factor_returns.loc[start:end], 
        factor_loadings.loc[start:end])

def plot_exposures(risk_exposures, ax=None):
    rep = risk_exposures.stack().reset_index()
    rep.columns = ['dt', 'factor', 'exposure']
    sns.boxplot(x='exposure', y='factor', data=rep, orient='h', ax=ax, order=risk_exposures.columns[::-1])

def compute_factor_stats(factor, pricing, factor_returns, factor_loadings, periods=range(1, 15), view=None):
    factor_data_total = al.utils.get_clean_factor_and_forward_returns(
        factor, 
        pricing,
        quantiles=None,
        bins=(-np.inf, 0, np.inf),
        periods=periods,
        cumulative_returns=False,
    )

    portfolio_returns_total = al.performance.factor_returns(factor_data_total)
    portfolio_returns_total.columns = portfolio_returns_total.columns.map(lambda x: int(x[:-1]))
    for i in portfolio_returns_total.columns:
        portfolio_returns_total[i] = portfolio_returns_total[i].shift(i)
    #portfolio_returns_specific = al.performance.factor_returns(factor_data_specific)
    #portfolio_returns_specific.columns = portfolio_returns_specific.columns.map(lambda x: int(x[:-1]))
    #for i in portfolio_returns_specific.columns:
    #    portfolio_returns_specific[i] = portfolio_returns_specific[i].shift(i)

    portfolio_returns_specific = pd.DataFrame(columns=portfolio_returns_total.columns, index=portfolio_returns_total.index)
    
    # closure
    def calc_perf_attrib_c(i, portfolio_returns_total=portfolio_returns_total, 
                           factor_data_total=factor_data_total, factor_returns=factor_returns, 
                           factor_loadings=factor_loadings):
        return calc_perf_attrib(portfolio_returns_total[i], 
                                factor_data_total['factor'].unstack().assign(cash=0).shift(i), 
                                factor_returns, factor_loadings)
    
    if view is None:
        perf_attrib = map(calc_perf_attrib_c, portfolio_returns_total.columns)
    else:
        perf_attrib = view.map_sync(calc_perf_attrib_c, portfolio_returns_total.columns)
        
    for i, pa in enumerate(perf_attrib):
        if i == 0:
            risk_exposures_portfolio = pa[0]
            perf_attribution = pa[1]
        portfolio_returns_specific[i + 1] = pa[1]['specific_returns']
    

    delay_sharpes_total = portfolio_returns_total.apply(ep.sharpe_ratio)
    delay_sharpes_specific = portfolio_returns_specific.apply(ep.sharpe_ratio)
    
    return {'factor_data_total': factor_data_total, 
            'portfolio_returns_total': portfolio_returns_total,
            'portfolio_returns_specific': portfolio_returns_specific,
            'risk_exposures_portfolio': risk_exposures_portfolio,
            'perf_attribution': perf_attribution,
            'delay_sharpes_total': delay_sharpes_total,
            'delay_sharpes_specific': delay_sharpes_specific,
    }

def plot_overview_tear_sheet(factor, pricing, factor_returns, factor_loadings, periods=range(1, 15), view=None):
    fig = plt.figure(figsize=(16, 16))
    gs = plt.GridSpec(4, 4)
    ax1 = plt.subplot(gs[0:2, 0:2])
    
    factor_stats = compute_factor_stats(factor, pricing, factor_returns, factor_loadings, periods=periods, view=view)
                         
    pd.DataFrame({'specific': factor_stats['delay_sharpes_specific'], 
                  'total': factor_stats['delay_sharpes_total']}).plot.bar(ax=ax1)
    ax1.set(xlabel='delay', ylabel='IR')

    ax2a = plt.subplot(gs[0, 2:4])
    delay_cum_rets_total = factor_stats['portfolio_returns_total'][list(range(1, 5))].apply(ep.cum_returns)
    delay_cum_rets_total.plot(ax=ax2a)
    ax2a.set(title='Total returns', ylabel='Cumulative returns')
    
    ax2b = plt.subplot(gs[1, 2:4])
    delay_cum_rets_specific = factor_stats['portfolio_returns_specific'][list(range(1, 5))].apply(ep.cum_returns)
    delay_cum_rets_specific.plot(ax=ax2b)
    ax2b.set(title='Specific returns', ylabel='Cumulative returns')
    
    ax3 = plt.subplot(gs[2:4, 0:2])
    plot_exposures(factor_stats['risk_exposures_portfolio'].reindex(columns=factor_stats['perf_attribution'].columns), 
                   ax=ax3)

    ax4 = plt.subplot(gs[2:4, 2])
    ep.cum_returns_final(factor_stats['perf_attribution']).plot.barh(ax=ax4)
    ax4.set(xlabel='Cumulative returns')

    ax5 = plt.subplot(gs[2:4, 3], sharey=ax4)
    factor_stats['perf_attribution'].apply(ep.annual_volatility).plot.barh(ax=ax5)
    ax5.set(xlabel='Ann. volatility')

    gs.tight_layout(fig)
    
    return fig, factor_stats
In [11]:
_, factor_stats = plot_overview_tear_sheet(port, 
                         pricing, 
                         factor_returns, 
                         factor_loadings);
Dropped 0.1% entries from factor data: 0.1% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

What time horizon is our alpha predictive for?

A very useful first plot to look at is the one in the upper left corner (total returns). This shows the information coefficient (the rank-correlation between our predictions and the n-days forward returns) if we were to delay our factor by 1 to 14 days. In other words, it shows how fast the alpha decays.

We look at this plot a lot when evaluating factors. It's useful because when actually trading on these factors, we tightly control turnover to reduce slippage. As such, we might not get the full exposure to this factor in our portfolio on the day a signal becomes available. Instead, we want to make sure that as we build exposure towards this factor (which can span a few days), it still has some predictive power left and is not actively hurting us. Thus, it is critical to know how long the alpha is good for and when it starts turning bad, which this plot conveys. Unfortunately this factor is not all that strong (please post an update if you have one that looks better). As you can see, cumulative alpha is negative most of the time.

Forecasting specific returns

Although looking at the total returns is useful, it has a shortcoming: it is only showing the IC of between the forecast and the total stock returns. In theory, we could be looking at a pure mean-reversion factor which might look good in terms of IC, but it would not be something new and interesting (and less likely to continue to be predictive). Thus, we do not just want to know how predictive the factor is for total returns, but also how predictive it is for specific returns (i.e. stock returns where we subtract out the common risk contributions).

Here the factor actually seems to have some small alpha and we can see the common pattern that most factors are most predictive on the first day and then decay over time. So this means if we failed to get the full exposure on the first day, we would miss out on a lot of predictive power. Moreover, if this factor value lingers around in our portfolio for a longer time (again, this could be due to turnover constraints), it actually starts to hurt us as the return forecast turns negative. Ideally the curve would look positive for total and specific returns over a long time horizon.

Then in the upper right corner you can see the cumulative returns.

Risk exposures

The lower plots quantify exactly what the exposures are. This functionality exists in pyfolio which was designed to analyze portfolios, not factors. While there exists a function in alphalens to convert a factor to portfolio returns and positions as expected by pyfolio, I wrote a custom one as I found the alphalens to be quite slow (still need to investigate). Note that we are also equal-weighting the long and short sides.

This plot is also critical because it shows us two things:

  1. Are there persistent exposures to risk factors? We would see this by the distribution (represented by the box-and-whiskers plot) being shifted to the left or right of zero for a single risk factor. This does not seem to be the case here.
  2. How much does exposure vary over time? This is represented by the width of the distribution and indicates factor timing: you are not just keeping a constant exposure but varying it over time (while still averaging out close to 0). There seems to be some of that going on here, specifically for size.

Another way of thinking about risk factors is as a method of telling you how close what you have is to what others have already discovered a long time ago.

Two other useful things to look at in terms of exposures are the cumulative returns you get from each individual exposure. This tells you where your cumulative returns are coming from.

Disclaimer

Thanks to David Sargent, Max Margenot, Josh Payne, and Luca Scarabello for useful feedback on an earlier draft.