Notebook

Factor Analysis

by Maxwell Margenot, Gil Wassermann, James Christopher Hall, and Delaney Granizo-Mackenzie.

Part of the Quantopian Lecture Series:

Notebook released under the Creative Commons Attribution 4.0 License. Please do not remove this attribution.

How can we tell whether an alpha factor is good or not? Unfortunately, there is no cut-off that tells you whether a factor is objectively useful. Instead, we need to compare a particular factor to other options before deciding whether to use it. Our end goal in defining and selecting the best factors is to use them to rank stocks in a long-short equity strategy, covered elsewhere in the lecture series. The more independent predictive the factors we use, the better our ranking scheme and our overall strategy will be.

What we want when comparing factors is to make sure the chosen signal is actually predictive of relative price movements. We do not want to predict the absolute amount the assets in our universe will move up or down. We only care that we can choose assets to long that will do better than the assets we short. In a long-short equity strategy, we hold a long basket and a short basket of assets, determined by the factor values associated with each asset in our universe. If our ranking scheme is predictive, this means that assets in the top basket will tend to outperform assets in the bottom basket. As long this spread is consistent over time our strategy will have a positive return.

An individual factor can have a lot of moving parts to assess, but ideally it should be independent of other factors that you are already trading on in order to keep your portfolio diverse. We discuss the reasoning for this in this lecture.

In this lecture, we detail and explain relevant statistics to evaluate your alpha factor before attempting to implement it in an algorithm. What's important to keep in mind is that all the metrics provided here are relative to other factors you may be trading or evaluating.

Let's have a look at a factor and try to assess its viability. We will calculate the factor values using Pipeline, so make sure you check out the tutorial if you are unfamiliar with how Pipeline works.

In [1]:
import numpy as np
import pandas as pd
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import CustomFactor, Returns
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.filters import Q1500US
from time import time

Momentum

Here we will be using a momentum factor as our example. Momentum factors are a very common form of alpha factor and they come in many shapes and sizes. They all try to get at the same idea, however, that securities in motion will stay in motion. Momentum factors try to quantify trends in financial markets and to "ride the wave", so to speak.

Let's say that we suspect that a momentum factor could potentially be predictive of stock returns. We define it as a CustomFactor so that we can easily pull its values when we run our Pipeline. We should get a factor value for every security in our universe.

In [2]:
class MyFactor(CustomFactor):
        """ Momentum factor """
        inputs = [USEquityPricing.close,
                  Returns(window_length=126)]
        window_length = 252

        def compute(self, today, assets, out, prices, returns):
            out[:] = ((prices[-21] - prices[-252])/prices[-252] - \
                      (prices[-1] - prices[-21])/prices[-21]) / np.nanstd(returns, axis=0)

This momentum factor takes the change in price over the past year, up until a month ago, and standardizes it over the the change in price over the last month. This allows it to account for any changes over the past month and use them to temper its expectations of future movement.

Judging a Factor with Alphalens

In order to judge whether a factor is viable, we have created a package called Alphalens. Its source code is available on github if you want to get into the nitty-gritty of how it works. We use Alphalens to create a "tear sheet" of a factor, similar to how we use pyfolio to create a tear sheet for analyzing backtests.

In [3]:
import alphalens as al

Alphalens takes your factor and examines how useful it is for predicting relative value through a collection of different metrics. It breaks all the stocks in your chosen universe into different quantiles based on their ranking according to your factor and analyzes the returns, information coefficient (IC), the turnover of each quantile, and provides a breakdown of returns and IC by sector.

Throughout the course of this lecture we will detail how to interpret the various individual plots generated by an Alphalens tear sheet and include the proper call to generate the whole tear sheet at once at the end.

Sector Codes

These are the possible sector codes for each security, as given by Morningstar. We will use this dictionary to help categorize our results as we walk through a factor analysis so that we can break out our information by sector.

In [4]:
MORNINGSTAR_SECTOR_CODES = {
     -1: 'Misc',
    101: 'Basic Materials',
    102: 'Consumer Cyclical',
    103: 'Financial Services',
    104: 'Real Estate',
    205: 'Consumer Defensive',
    206: 'Healthcare',
    207: 'Utilities',
    308: 'Communication Services',
    309: 'Energy',
    310: 'Industrials',
    311: 'Technology' ,    
}

Defining a universe

As always, we need to define our universe. In this case we use the Q1500US, as seen in the forums here.

In [5]:
universe = Q1500US()

Getting Data

Now we will pull values for our factor for all stocks in our universe by using Pipeline. We also want to make sure that we have the sector code for each individual equity, so we add Sector as another factor for our Pipeline. Note that running the Pipeline may take a while.

In [6]:
pipe = Pipeline(
    columns = {
            'MyFactor' : MyFactor(mask=universe),
            'Sector' : Sector()
    },
    screen=universe
)

start_timer = time()
results = run_pipeline(pipe, '2015-01-01', '2016-01-01')
end_timer = time()
results.fillna(value=0);
/usr/local/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:1147: RuntimeWarning: Degrees of freedom <= 0 for slice.
  warnings.warn("Degrees of freedom <= 0 for slice.", RuntimeWarning)
In [7]:
print "Time to run pipeline %.2f secs" % (end_timer - start_timer)
Time to run pipeline 76.67 secs

Let's take a look at the data to get a quick sense of what we have.

In [8]:
my_factor = results['MyFactor']
print my_factor.head()
2015-01-02 00:00:00+00:00  Equity(2 [ARNC])     5.239913
                           Equity(24 [AAPL])    5.451305
                           Equity(41 [ARCB])    1.155913
                           Equity(62 [ABT])     4.818421
                           Equity(67 [ADSK])    1.724293
Name: MyFactor, dtype: float64

Our my_factor variable contains a pandas Series with a factor value for each equity in our universe for each point in time.

Here we create another Series that contains sector codes for each equity instead of factor values. This is categorical data that we will use as a parameter for Alphalens later.

In [9]:
sectors = results['Sector']

While our universe is defined to consist of 1500 stocks, the actual number of unique stocks that we end up ranking will likely be greater than this due to stocks passing in and out of our filters. For this reason, we grab pricing data for any stock that may have been in our Pipeline at some point to make sure that we have all of the data that we might need.

In [10]:
asset_list = results.index.levels[1].unique()
In [11]:
prices = get_pricing(asset_list, start_date='2015-01-01', end_date='2016-02-01', fields='price')
In [12]:
prices.head()
Out[12]:
Equity(2 [ARNC]) Equity(24 [AAPL]) Equity(41 [ARCB]) Equity(53 [ABMD]) Equity(62 [ABT]) Equity(67 [ADSK]) Equity(76 [TAP]) Equity(110 [ACXM]) Equity(114 [ADBE]) Equity(122 [ADI]) ... Equity(48925 [ADRO]) Equity(48934 [ETSY]) Equity(48962 [UNIT]) Equity(49003 [TLN]) Equity(49051 [APLE]) Equity(49060 [SHOP]) Equity(49141 [CPGX]) Equity(49176 [TRU]) Equity(49229 [KHC]) Equity(49496 [FDC])
2015-01-02 00:00:00+00:00 15.717 107.469 45.513 37.30 43.701 59.53 72.300 19.605 72.33 54.041 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-05 00:00:00+00:00 14.817 104.470 44.839 37.09 43.721 58.66 71.889 19.425 71.99 53.058 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-06 00:00:00+00:00 14.906 104.451 42.805 36.13 43.205 57.50 71.527 19.080 70.52 51.812 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-07 00:00:00+00:00 15.302 105.945 41.734 37.28 43.565 57.37 73.807 19.330 71.12 52.357 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2015-01-08 00:00:00+00:00 15.757 109.996 42.716 38.96 44.451 58.80 76.088 19.790 72.91 53.281 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1735 columns

Alphalens Components

Now that we have the basic components of what we need to analyze our factor, we can start to deal with Alphalens. Note that we will be breaking out individual components of the package, so this is not the typical workflow for using an Alphalens tear sheet.

First we get our factor categorized by sector code and calculate our forward returns. The forward returns are the returns that we would have received for holding each security over the day periods ending on the given date, passed in through the periods parameter. In our case, and by default, we look $1$, $5$, and $10$ days in advance. We can consider this a budget backtest. The tear sheet does not factor in any commission or slippage cost, rather, it only considers values as if we had magically already held the specified equities for the specified number of days up to the current day.

In [13]:
factor_data = al.utils.get_clean_factor_and_forward_returns(factor=my_factor,
                                                            prices=prices,
                                                            groupby=sectors,
                                                            groupby_labels=MORNINGSTAR_SECTOR_CODES,
                                                            periods=(1,5,10))

The factor variable here is similar to the my_factor variable above. It has a factor value for every equity in our universe at each point in time. Our Alphalens function here has also provided a sector grouping to go along with the factor value.

In [14]:
factor_data.head()
Out[14]:
1 5 10 factor group factor_quantile
date asset
2015-01-02 00:00:00+00:00 Equity(2 [ARNC]) -0.057263 0.014507 -0.038366 5.239913 Basic Materials 5
Equity(24 [AAPL]) -0.027906 0.024705 -0.030372 5.451305 Technology 5
Equity(41 [ARCB]) -0.014809 -0.095028 -0.084350 1.155913 Industrials 3
Equity(62 [ABT]) 0.000458 0.006682 -0.004325 4.818421 Healthcare 5
Equity(67 [ADSK]) -0.014614 -0.020998 -0.045187 1.724293 Technology 4

As explained above, the forward returns are the returns that we would have received for holding each security for the specified number of days, ending on the given date. These, too, are broken out by sector.

This function also separates our factor into quantiles for each date, replacing the factor value with its appropriate quantile on a given day. Since we will be holding baskets of the top and bottom quantiles, we only care about the factor insofar as it relates to movement into and out of these baskets.

We take this quantized factor and calculate the returns for each basket (quantile) over the given day periods.

In [15]:
mean_return_by_q_daily, std_err_by_q_daily = al.performance.mean_return_by_quantile(factor_data,
                                                                                    by_date=True)
In [16]:
mean_return_by_q_daily.head()
Out[16]:
1 5 10
factor_quantile date
1 2015-01-02 00:00:00+00:00 -0.010045 -0.021101 -0.033956
2015-01-05 00:00:00+00:00 -0.005019 -0.015829 -0.030863
2015-01-06 00:00:00+00:00 -0.002690 -0.014040 -0.015068
2015-01-07 00:00:00+00:00 0.001770 -0.009864 -0.015686
2015-01-08 00:00:00+00:00 -0.000807 -0.021366 -0.018793

With the by_date boolean flag, we decide whether to look at our quantized returns either as a time series or as a point estimate. Here we calculate only the point estimates.

In [17]:
mean_return_by_q, std_err_by_q = al.performance.mean_return_by_quantile(factor_data,
                                                                        by_group=False)
In [18]:
mean_return_by_q.head()
Out[18]:
1 5 10
factor_quantile
1 -0.000565 -0.002743 -0.005745
2 -0.000163 -0.000489 -0.000495
3 0.000138 0.000677 0.001231
4 0.000249 0.000972 0.001731
5 0.000342 0.001586 0.003286

These point estimates were also calculated agnostic of the sector groupings so they give us an overall view of what our spread would look like if we traded this factor with a long-short equity algorithm and didn't examine which sectors those returns were coming from.

In some of these plots we apply a utility function to our quantile returns before plotting. The purpose here is to look at the growth rates for each time period to get everything on the same scale for easy comparison. This happens automatically when creating tear sheets, but since we are computing everything individually here we have to make sure we include the helper function.

In [19]:
al.plotting.plot_quantile_returns_bar(mean_return_by_q.apply(al.utils.rate_of_return, axis=0));

Here we construct the violin plots of our time series of quantile returns. A violin plot shows the density of our data. The fatter the violin, the higher the density of returns in that region. Here we plot the violins for the $1$, $5$, and $10$ day forward returns for each quantile.

In [20]:
al.plotting.plot_quantile_returns_violin(mean_return_by_q_daily.apply(al.utils.rate_of_return, axis=0));

Here we calculate the basis points of the spread, based on subtracting the mean return of the lowest quantile from the mean return of the highest quantile (simulating going short on the lowest and long on the highest). We also get the error and plot it all together, giving us a time series of the basis points with confidence intervals for each time period.

In [21]:
quant_return_spread, std_err_spread = al.performance.compute_mean_returns_spread(mean_return_by_q_daily.apply(al.utils.rate_of_return, axis=0),
                                                                                 upper_quant=5,
                                                                                 lower_quant=1,
                                                                                 std_err=std_err_by_q_daily)
In [22]:
al.plotting.plot_mean_quantile_returns_spread_time_series(quant_return_spread, std_err_spread);

This next plot aggregates the returns of each individual quantile into a plot of cumulative returns separated by basket for the 1-period forward returns. What we want here is to see five discrete "fingers" with few to no crossovers. This will give us an idea of which quantiles tend to drive the returns (ideally the first and fifth).

In [23]:
al.plotting.plot_cumulative_returns_by_quantile(mean_return_by_q_daily);

This next function gives us the returns for each time period weighted by factor.

In [24]:
ls_factor_returns = al.performance.factor_returns(factor_data)
In [25]:
ls_factor_returns.head()
Out[25]:
1 5 10
date
2015-01-02 00:00:00+00:00 0.007510 0.016452 0.029466
2015-01-05 00:00:00+00:00 0.004336 0.011858 0.024203
2015-01-06 00:00:00+00:00 0.002296 0.011019 0.015208
2015-01-07 00:00:00+00:00 0.000155 0.007644 0.013604
2015-01-08 00:00:00+00:00 0.001378 0.016168 0.014851

Here we plot the cumulative factor-weighted returns in a long-short portfolio. This shows the performance of the factor as a whole, which is always important to consider. A long-short portfolio will only involve the first and fifth quantiles.

In [26]:
al.plotting.plot_cumulative_returns(ls_factor_returns[1]);

Now we calculate the $\alpha$ and $\beta$ of our factor with respect to the market. These are calculated by creating a regression of the market returns for each period against a long-short factor portfolio and extracting the parameters. These signify the excess return associated with our factor and the market beta, respectively.

In [27]:
alpha_beta = al.performance.factor_alpha_beta(factor_data)
In [28]:
alpha_beta
Out[28]:
1 5 10
Ann. alpha 0.099815 0.082969 0.077813
beta -0.070708 -0.189594 -0.199872

Returns Tear Sheet

If we are solely interested in the above returns plots, we can create a tear sheet that only contains this returns analysis. The following code block generates all of the above graphs once we have stored the forward returns data!

In [ ]:
al.tears.create_returns_tear_sheet(factor_data)

Information Coefficient

We use the information coefficient (IC) to assess the predictive power of a factor. The IC of a factor is its Spearman Rank Correlation. For more background on the mathematics associated with the IC, check out the Spearman Rank Correlation Lecture. To break it down, we calculate the IC between the factor values and the forward returns for each period. The IC assesses the monotonic relationship between factors and returns. What this means, intuitively, is that it provides a measure for whether higher factor values can be associated with higher returns. A higher IC indicates that higher factor values are more closely associated with higher return values (and lower factor values with lower return values). A very negative IC indicates that higher factor values are closely associated with lower return values. An IC of $0$ indicates no relationship.

Using Alphalens, we extract the IC for each time period below.

In [29]:
ic = al.performance.factor_information_coefficient(factor_data)
In [30]:
ic.head()
Out[30]:
1 5 10
date
2015-01-02 00:00:00+00:00 0.305055 0.344902 0.375436
2015-01-05 00:00:00+00:00 0.253184 0.265428 0.328380
2015-01-06 00:00:00+00:00 0.175906 0.205615 0.230497
2015-01-07 00:00:00+00:00 -0.018402 0.144102 0.196473
2015-01-08 00:00:00+00:00 0.062367 0.286953 0.214206

Here we plot the IC as a time series for each period along with a 1-month moving average to smooth it out. What we want here is consistency over time and a consistently informative signal.

In [31]:
al.plotting.plot_ic_ts(ic);

Histograms are good to show us the distribution of the IC. These will clearly show any strange outliers and how they affect the overall curve.

In [32]:
al.plotting.plot_ic_hist(ic);

A QQ-plot compares the distribution of the IC to the normal distribution. It plots the quantiles of one distribution against the quantiles of the other, typically with a reference line at $y = x$. If the points in the QQ-plot fall entirely along this line, this indicates that the two distributions are the same. In practice, a QQ-plot serves as a measure of similarity between distributions. Generally, what we want to see here is an S-shaped curve. This indicates that the tails of the IC distribution are fatter and contain more information.

In [33]:
al.plotting.plot_ic_qq(ic);

The following heatplots show the monthly mean IC, providing us with another visual of consistency.

In [34]:
mean_monthly_ic = al.performance.mean_information_coefficient(factor_data, by_time='M')
In [35]:
al.plotting.plot_monthly_ic_heatmap(mean_monthly_ic);

Information Tear Sheet

As with the returns tear sheet, we can also create an information tear sheet that just gives us data on the information coefficient.

In [ ]:
al.tears.create_information_tear_sheet(factor_data)

Turnover

When considering the impact of actually implementing a signal in a strategy, turnover is a critical thing to consider. This plot shows the turnover of the top and bottom quantiles of your factor, the baskets that you would actually be trading on with a long-short approach. Excessive turnover will eat away at the profits of your strategy through commission costs. Sometimes a signal just isn't good enough to fight against the tide on the scale that you have to deal with through your broker.

In [36]:
al.plotting.plot_top_bottom_quantile_turnover(factor_data);

This plot shows a fairly low turnover for the factor, implying that we will not be hit too hard by the consist changing of portfolio positions. We cannot see this directly, however, because Alphalens does not model commission costs. It simply provides metrics that we can use to judge a factor by itself. To properly model undermining influences such as slippage and commissions you will need to implement a strategy that uses your factor in the backtester.

Autocorrelation

Factor autocorrelation is the measure of correlation between the current value of the factor and its previous value. The idea behind its calculation is to provide another measure of the turnover of the factor quantiles. If the autocorrelation is low, it implies that the current value of the factor has little to do with the previous value and that portfolio positions are changing frequently from time period to time period. If the next value of the factor is significantly influenced by its last value, this means that your ranking scheme is more consistent (though this has no influence on its ability to forecast relative price movements).

In [37]:
factor_autocorrelation = al.performance.factor_rank_autocorrelation(factor_data)
In [38]:
factor_autocorrelation.head()
Out[38]:
date
2015-01-02 00:00:00+00:00         NaN
2015-01-05 00:00:00+00:00    0.986763
2015-01-06 00:00:00+00:00    0.987452
2015-01-07 00:00:00+00:00    0.987197
2015-01-08 00:00:00+00:00    0.987724
Name: 1, dtype: float64
In [39]:
al.plotting.plot_factor_rank_auto_correlation(factor_autocorrelation);

In this case, we have fairly high autocorrelation, corroborating the turnover plots from above that suggested more consistent portfolio positions.

Turnover Tear Sheet

In [ ]:
al.tears.create_turnover_tear_sheet(factor_data)

Group Breakdown

In addition to all these other metrics, Alphalens provides a breakdown of IC and returns by group. Here we use industry sector as the delineating group. While it is good to consider breakdowns by quantile, it is also important to see how your factor is exposed to the different facets of the market. This is a good way to assess if your factor is behaving as it should in your universe. For example, if you intend your universe to only touch a certain sector, it is worthwhile to confirm that your factor and universe indeed only touch that sector.

Here we get the mean IC by sector for each time period.

In [40]:
ic_by_sector = al.performance.mean_information_coefficient(factor_data, by_group=True)
In [41]:
ic_by_sector.head()
Out[41]:
1 5 10
group
Basic Materials 0.031098 0.047993 0.061131
Communication Services 0.005744 0.035223 0.039208
Consumer Cyclical 0.023176 0.037961 0.043167
Consumer Defensive 0.023258 0.044501 0.053993
Energy 0.031516 0.068109 0.090795
In [42]:
al.plotting.plot_ic_by_group(ic_by_sector);

Looking at the returns by quantile for each individual sector helps to show which sectors are driving the bulk of our returns as well as whether quantiles in each sector are broken out how they should be (with the lowest quantiles giving the lowest returns up to the highest quantiles giving the highest returns). If an individual sector has little to no signal (IC), it makes sense for the quantile returns to be all over the place. We want to make sure that everything is behaving nicely.

In [43]:
mean_return_quantile_sector, mean_return_quantile_sector_err = al.performance.mean_return_by_quantile(factor_data,
                                                                                                      by_group=True)
In [44]:
mean_return_quantile_sector.head()
Out[44]:
1 5 10
factor_quantile group
1 Basic Materials -0.001327 -0.007774 -0.015843
Communication Services -0.000024 -0.001357 -0.002284
Consumer Cyclical -0.000273 -0.001662 -0.003119
Consumer Defensive 0.000529 0.002763 0.005296
Energy -0.001601 -0.007038 -0.015101
In [45]:
al.plotting.plot_quantile_returns_bar(mean_return_quantile_sector.apply(al.utils.rate_of_return, axis=0), by_group=True);

The Short Version

You don't need to run all those different blocks of code every time, fortunately. The individual tear sheets can be incredibly helpful separated from each other, but if you simply want to generate all of the sections at once, you can create a full tear sheet. We only need to pass in the initial factor values, the prices, the periods, and the groupby characteristics for sector breakouts. The syntax for generating the full tear sheet all at once is as follows:

In [ ]:
al.tears.create_full_tear_sheet(factor_data, by_group=True)

More on Factors

Coming up with new factors is all well and good, but often you will need many independently predictive factors to produce a signal that is stable enough to trade. Methods of aggregating factors together will be discussed in future lectures, but the most simple initial approach would just be to normalize the values of each factor you would like to include in the aggregate, add the new normalized factor together, and rank by this new combined factor value.

Next Steps

Once you have a factor that looks good, the next step is to implement it in an algorithm. Unfortunately, it isn't enough to simply have a good signal. Trading algorithms have to take into account many other considerations that are not included in Alphalens. We need to include how the market at large will react to the trades we're making (market impact/slippage) as well as the transaction costs associated with making those trades. These influences can erode our profits if we do not properly assess their impact through extensive testing.

To this end, we have the full backtesting environment. It allows for slippage and transaction cost modeling and lets you set limitations for the amount of leverage (debt) that your algorithm can take on to make its trades. Learn more about leverage in this lecture.

We have an example long-short algorithm that you can clone and use to test your own factors. Try adding in the momentum factor that we played within this lecture to see how the addition of slippage and transaction costs affect the trades made and the resultant profits.

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.