Notebook

Factor Analysis

by Maxwell Margenot, Gil Wassermann, James Christopher Hall, and Delaney Granizo-Mackenzie.

Part of the Quantopian Lecture Series:

Notebook released under the Creative Commons Attribution 4.0 License. Please do not remove this attribution.

How can we tell whether an alpha factor is good or not? Unfortunately, there is no cut-off that tells you whether a factor is objectively useful. Instead, we need to compare a particular factor to other options before deciding whether to use it. Our end goal in defining and selecting the best factors is to use them to rank stocks in a long-short equity strategy, covered elsewhere in the lecture series. The more independent predictive the factors we use, the better our ranking scheme and our overall strategy will be.

What we want when comparing factors is to make sure the chosen signal is actually predictive of relative price movements. We do not want to predict the absolute amount the assets in our universe will move up or down. We only care that we can choose assets to long that will do better than the assets we short. In a long-short equity strategy, we hold a long basket and a short basket of assets, determined by the factor values associated with each asset in our universe. If our ranking scheme is predictive, this means that assets in the top basket will tend to outperform assets in the bottom basket. As long this spread is consistent over time our strategy will have a positive return.

An individual factor can have a lot of moving parts to assess, but ideally it should be independent of other factors that you are already trading on in order to keep your portfolio diverse. We discuss the reasoning for this in this lecture.

In this lecture, we detail and explain relevant statistics to evaluate your alpha factor before attempting to implement it in an algorithm. What's important to keep in mind is that all the metrics provided here are relative to other factors you may be trading or evaluating.

Let's have a look at a factor and try to assess its viability. We will calculate the factor values using Pipeline, so make sure you check out the tutorial if you are unfamiliar with how Pipeline works.

In [1]:
import numpy as np
import pandas as pd
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import CustomFactor, Returns
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q1500US
from time import time

Momentum

Here we will be using a momentum factor as our example. Momentum factors are a very common form of alpha factor and they come in many shapes and sizes. They all try to get at the same idea, however, that securities in motion will stay in motion. Momentum factors try to quantify trends in financial markets and to "ride the wave", so to speak.

Let's say that we suspect that a momentum factor could potentially be predictive of stock returns. We define it as a CustomFactor so that we can easily pull its values when we run our Pipeline. We should get a factor value for every security in our universe.

In [2]:
class MyFactor(CustomFactor):
        """ Momentum factor """
        inputs = [USEquityPricing.close,
                  Returns(window_length=126)]
        window_length = 252

        def compute(self, today, assets, out, prices, returns):
            out[:] = ((prices[-21] - prices[-252])/prices[-252] - \
                      (prices[-1] - prices[-21])/prices[-21]) / np.nanstd(returns, axis=0)

This momentum factor takes the change in price over the past year, up until a month ago, and standardizes it over the the change in price over the last month. This allows it to account for any changes over the past month and use them to temper its expectations of future movement.

Judging a Factor with Alphalens

In order to judge whether a factor is viable, we have created a package called Alphalens. Its source code is available on github if you want to get into the nitty-gritty of how it works. We use Alphalens to create a "tear sheet" of a factor, similar to how we use pyfolio to create a tear sheet for analyzing backtests.

In [3]:
import alphalens as al

Alphalens takes your factor and examines how useful it is for predicting relative value through a collection of different metrics. It breaks all the stocks in your chosen universe into different quantiles based on their ranking according to your factor and analyzes the returns, information coefficient (IC), the turnover of each quantile, and provides a breakdown of returns and IC by sector.

Throughout the course of this lecture we will detail how to interpret the various individual plots generated by an Alphalens tear sheet and include the proper call to generate the whole tear sheet at once at the end.

Sector Codes

These are the possible sector codes for each security, as given by Morningstar. We will use this dictionary to help categorize our results as we walk through a factor analysis so that we can break out our information by sector.

In [4]:
MORNINGSTAR_SECTOR_CODES = {
     -1: 'Misc',
    101: 'Basic Materials',
    102: 'Consumer Cyclical',
    103: 'Financial Services',
    104: 'Real Estate',
    205: 'Consumer Defensive',
    206: 'Healthcare',
    207: 'Utilities',
    308: 'Communication Services',
    309: 'Energy',
    310: 'Industrials',
    311: 'Technology' ,    
}

Defining a universe

As always, we need to define our universe. In this case we use the Q1500US, as seen in the forums here.

In [5]:
universe = Q1500US()

Getting Data

Now we will pull values for our factor for all stocks in our universe by using Pipeline. We also want to make sure that we have the sector code for each individual equity, so we add Sector as another factor for our Pipeline. Note that running the Pipeline may take a while.

In [6]:
pipe = Pipeline(
    columns = {
            'MyFactor' : MyFactor(mask=universe),
            'Sector' : Sector()
    },
    screen=universe
)

start_timer = time()
results = run_pipeline(pipe, '2015-01-01', '2016-01-01')
end_timer = time()
results.fillna(value=0);

/venvs/py27/local/lib/python2.7/site-packages/numpy/lib/nanfunctions.py:1202: RuntimeWarning: Degrees of freedom <= 0 for slice.
  warnings.warn("Degrees of freedom <= 0 for slice.", RuntimeWarning)
Pipeline Execution Time: 8.23 Seconds
In [7]:
print "Time to run pipeline %.2f secs" % (end_timer - start_timer)
Time to run pipeline 12.21 secs

Let's take a look at the data to get a quick sense of what we have.

In [8]:
my_factor = results['MyFactor']
print my_factor.head()
2015-01-02 00:00:00+00:00  Equity(2 [ARNC])     5.239913
                           Equity(24 [AAPL])    5.451305
                           Equity(41 [ARCB])    1.155913
                           Equity(62 [ABT])     4.818421
                           Equity(67 [ADSK])    1.724293
Name: MyFactor, dtype: float64

Our my_factor variable contains a pandas Series with a factor value for each equity in our universe for each point in time.

Here we create another Series that contains sector codes for each equity instead of factor values. This is categorical data that we will use as a parameter for Alphalens later.

In [9]:
sectors = results['Sector']

While our universe is defined to consist of 1500 stocks, the actual number of unique stocks that we end up ranking will likely be greater than this due to stocks passing in and out of our filters. For this reason, we grab pricing data for any stock that may have been in our Pipeline at some point to make sure that we have all of the data that we might need.

In [10]:
asset_list = results.index.levels[1].unique()
In [11]:
prices = get_pricing(asset_list, start_date='2015-01-01', end_date='2016-02-01', fields='price')
In [12]:
prices.head()
Out[12]:
Equity(2 [ARNC]) Equity(21 [AAME]) Equity(24 [AAPL]) Equity(25 [ARNC_PR]) Equity(31 [ABAX]) Equity(39 [DDC]) Equity(41 [ARCB]) Equity(52 [ABM]) Equity(53 [ABMD]) Equity(62 [ABT]) ... Equity(49682 [DYLS]) Equity(49683 [IMOM]) Equity(49684 [MCX]) Equity(49685 [NOK_WI]) Equity(49686 [RIV]) Equity(49687 [RNVA_W]) Equity(49688 [UDBI]) Equity(49689 [LVHD]) Equity(49690 [EDBI]) Equity(49691 [DDBI])
2015-01-02 00:00:00+00:00 15.717 4.01 107.469 83.562246 57.253 17.061 45.513 27.804 37.30 43.701 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-05 00:00:00+00:00 14.817 3.99 104.470 83.562246 57.124 17.070 44.839 27.872 37.09 43.721 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-06 00:00:00+00:00 14.906 3.90 104.451 84.172000 56.648 17.061 42.805 27.999 36.13 43.205 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-07 00:00:00+00:00 15.302 3.90 105.945 82.343000 56.638 17.195 41.734 28.479 37.28 43.565 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-08 00:00:00+00:00 15.757 3.93 109.996 82.285000 58.047 17.080 42.716 28.792 38.96 44.451 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 10076 columns

Alphalens Components

Now that we have the basic components of what we need to analyze our factor, we can start to deal with Alphalens. Note that we will be breaking out individual components of the package, so this is not the typical workflow for using an Alphalens tear sheet.

First we get our factor categorized by sector code and calculate our forward returns. The forward returns are the returns that we would have received for holding each security over the day periods ending on the given date, passed in through the periods parameter. In our case, and by default, we look $1$, $5$, and $10$ days in advance. We can consider this a budget backtest. The tear sheet does not factor in any commission or slippage cost, rather, it only considers values as if we had magically already held the specified equities for the specified number of days up to the current day.

In [13]:
factor_data = al.utils.get_clean_factor_and_forward_returns(factor=my_factor,
                                                            prices=prices,
                                                            groupby=sectors,
                                                            groupby_labels=MORNINGSTAR_SECTOR_CODES,
                                                            periods=(1,5,10))
Dropped 2.7% entries from factor data: 2.7% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!

The factor variable here is similar to the my_factor variable above. It has a factor value for every equity in our universe at each point in time. Our Alphalens function here has also provided a sector grouping to go along with the factor value.

In [14]:
factor_data.head()
Out[14]:
1D 5D 10D factor group factor_quantile
date asset
2015-01-02 00:00:00+00:00 Equity(2 [ARNC]) -0.057263 0.014507 -0.038366 5.239913 Basic Materials 5
Equity(24 [AAPL]) -0.027906 0.024705 -0.030372 5.451305 Technology 5
Equity(41 [ARCB]) -0.014809 -0.095028 -0.084350 1.155913 Industrials 3
Equity(62 [ABT]) 0.000458 0.006682 -0.004325 4.818421 Healthcare 5
Equity(67 [ADSK]) -0.014614 -0.020998 -0.045187 1.724293 Technology 4

As explained above, the forward returns are the returns that we would have received for holding each security for the specified number of days, ending on the given date. These, too, are broken out by sector.

This function also separates our factor into quantiles for each date, replacing the factor value with its appropriate quantile on a given day. Since we will be holding baskets of the top and bottom quantiles, we only care about the factor insofar as it relates to movement into and out of these baskets.

We take this quantized factor and calculate the returns for each basket (quantile) over the given day periods.

In [15]:
mean_return_by_q_daily, std_err_by_q_daily = al.performance.mean_return_by_quantile(factor_data,
                                                                                    by_date=True)
In [16]:
mean_return_by_q_daily.head()
Out[16]:
1D 5D 10D
factor_quantile date
1 2015-01-02 00:00:00+00:00 -0.010208 -0.021429 -0.034598
2015-01-05 00:00:00+00:00 -0.004830 -0.015678 -0.030892
2015-01-06 00:00:00+00:00 -0.002547 -0.014234 -0.015757
2015-01-07 00:00:00+00:00 0.001610 -0.010564 -0.016571
2015-01-08 00:00:00+00:00 -0.001009 -0.022151 -0.019200

With the by_date boolean flag, we decide whether to look at our quantized returns either as a time series or as a point estimate. Here we calculate only the point estimates.

In [17]:
mean_return_by_q, std_err_by_q = al.performance.mean_return_by_quantile(factor_data,
                                                                        by_group=False)
In [18]:
mean_return_by_q.head()
Out[18]:
1D 5D 10D
factor_quantile
1 -0.000584 -0.002833 -0.005991
2 -0.000149 -0.000494 -0.000519
3 0.000146 0.000728 0.001421
4 0.000242 0.001003 0.001781
5 0.000344 0.001601 0.003319
In [35]:
mean_return_by_q[['1D']]
Out[35]:
<class 'pandas.core.frame.DataFrame'>

These point estimates were also calculated agnostic of the sector groupings so they give us an overall view of what our spread would look like if we traded this factor with a long-short equity algorithm and didn't examine which sectors those returns were coming from.

In some of these plots we apply a utility function to our quantile returns before plotting. The purpose here is to look at the growth rates for each time period to get everything on the same scale for easy comparison. This happens automatically when creating tear sheets, but since we are computing everything individually here we have to make sure we include the helper function.

In [37]:
al.plotting.plot_quantile_returns_bar(mean_return_by_q.apply(al.utils.rate_of_return, axis=0));

TypeErrorTraceback (most recent call last)
<ipython-input-37-7f9eb43cde2c> in <module>()
----> 1 al.plotting.plot_quantile_returns_bar(mean_return_by_q.apply(al.utils.rate_of_return, axis=0));

/venvs/py27/local/lib/python2.7/site-packages/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4059                     if reduce is None:
   4060                         reduce = True
-> 4061                     return self._apply_standard(f, axis, reduce=reduce)
   4062             else:
   4063                 return self._apply_broadcast(f, axis)

/venvs/py27/local/lib/python2.7/site-packages/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   4155             try:
   4156                 for i, v in enumerate(series_gen):
-> 4157                     results[i] = func(v)
   4158                     keys.append(v.name)
   4159             except Exception as e:

TypeError: ('rate_of_return() takes exactly 2 arguments (1 given)', u'occurred at index 1D')