Notebook

Applying Alpha Vertext Machine Learning to a Mean-Reversion Strategy

by Jeremy Muhia

The objective of this study was to use data from the Alpha Vertex data set to enhance a short term mean reversion strategy. This data set contains data points that indicate 5 day future returns as predicted by machine learning algorithms. This data was used to screen for stocks that should or should not be included in the portfolio based on future returns quantiles. The rest of this notebook focuses on describing the Alpha Vertex data.

Run the cell below before beginning

In [13]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.filters import Q500US
from quantopian.pipeline.factors import CustomFactor
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.research import run_pipeline

import alphalens

# These imports can be found in the store panel for each dataset
# (https://www.quantopian.com/data). Note that not all store datasets
# can be used in pipeline yet.
from quantopian.pipeline.data.alpha_vertex import (
    # Top 100 Securities
    precog_top_100 as dataset_100,
    # Top 500 Securities
    precog_top_500 as dataset_500
)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



class PredictionQuality(CustomFactor):
    """
    create a customized factor to calculate the prediction quality
    for each stock in the universe.
    
    compares the percentage of predictions with the correct sign 
    over a rolling window (3 weeks) for each stock
   
    """
    # data used to create custom factor
    inputs = [dataset_500.predicted_five_day_log_return, USEquityPricing.close]
    
    # change this to what you want
    window_length = 15

    def compute(self, today, assets, out, pred_ret, px_close):
        # actual returns
        px_close_df = pd.DataFrame(data=px_close)
        pred_ret_df = pd.DataFrame(data=pred_ret)
        log_ret5_df = np.log(px_close_df) - np.log(px_close_df.shift(5))

        log_ret5_df = log_ret5_df.iloc[5:].reset_index(drop=True)
        n = len(log_ret5_df)
        
        # predicted returns
        pred_ret_df = pred_ret_df.iloc[:n]

        # number of predictions with incorrect sign
        err_df = (np.sign(log_ret5_df) - np.sign(pred_ret_df)).abs() / 2.0

        # custom quality measure
        pred_quality = (1 - pd.ewma(err_df, min_periods=n, com=n)).iloc[-1].values
        
        out[:] = pred_quality

        
        
class NormalizedReturn(CustomFactor):
    """
    Custom Factor to calculate the normalized forward return 
       
    scales the forward return expecation by the historical volatility
    of returns
    
    """

    # data used to create custom factor
    inputs = [dataset_500.predicted_five_day_log_return, USEquityPricing.close]
    
    # change this to what you want
    window_length = 10

    def compute(self, today, assets, out, pred_ret, px_close):
        # mean return 
        avg_ret = np.nanmean(pred_ret[-1], axis =0)
        
        # standard deviation of returns
        std_ret = np.nanstd(pred_ret[-1], axis=0)

        # normalized returns
        norm_ret = (pred_ret[-1] - avg_ret) / std_ret

        out[:] = norm_ret



START = '2016-01-01'
END = '2017-01-01'

MORNINGSTAR_SECTOR_CODES = {
     -1: 'Misc',
    101: 'Basic Materials',
    102: 'Consumer Cyclical',
    103: 'Financial Services',
    104: 'Real Estate',
    205: 'Consumer Defensive',
    206: 'Healthcare',
    207: 'Utilities',
    308: 'Communication Services',
    309: 'Energy',
    310: 'Industrials',
    311: 'Technology' ,    
}

We start by creating a universe of stocks that are included in the Q500 and also have a recent prediction in the Alpha Vertex Data.

Then, the PredictionQuality class is used to later create a filter that can be used to calculate normalized returns for stocks whose prediction is above a given quality threshold.

Finally, this normalized return value is used to create a pipeline from the beginning of 2016 to the beginning of 2017.

In [14]:
# get stocks covered in the Q500 that have recent prediction data in AlphaVertex
covered_stocks = Q500US() & dataset_500.predicted_five_day_log_return.latest.notnull()

prediction_quality = PredictionQuality(mask=covered_stocks)
quality = prediction_quality > 0.65
normalized_return = NormalizedReturn(mask=quality)

# create a pipeline of only stocks that are covered above
pipe = Pipeline(
    columns={
        'predicted 5 day returns' : dataset_500.predicted_five_day_log_return.latest,
        'normalized returns': normalized_return,
        'sector' : Sector(mask=covered_stocks)
    },
    screen=covered_stocks
)

# run the pipeline
pipe_output = run_pipeline(pipe, start_date=START, end_date=END)

Below is a sample of the Pipeline. Also, note that there are 457 unique stocks in the entire pipeline.

In [15]:
pipe_output.head()
Out[15]:
normalized returns predicted 5 day returns sector
2016-01-04 00:00:00+00:00 Equity(2 [ARNC]) -0.250973 -0.016 101
Equity(24 [AAPL]) NaN 0.038 311
Equity(62 [ABT]) 0.252332 -0.005 206
Equity(67 [ADSK]) 0.618371 0.003 311
Equity(76 [TAP]) 0.206577 -0.006 205
In [16]:
# this is the number of unique securities in the dataframe over the entire year that the pipeline is run
len(pipe_output.index.get_level_values(1).unique())
Out[16]:
457

The figure below shows how many stocks have a normalized return prediction (blue line), a non-normalized return prediction (green line), and a valid sector code (red line).

Notice that there are far fewer stocks with a normalized return prediction. My assumption is that because normalized returns are only calculated for stocks whose prediction overcome a certain prediction quality threshold, this results in a fraction of the stocks having normalized returns predictions. Also, the upper and lower bounds for stocks with normalized returns predictions tighten over time. I'm not sure why this happens.

In [17]:
# the green line shows the number of stocks at each time period with valid predicted returns
# the blue line shows the number of stocks at each time period with valid normalized returns

# the significant difference here is due to the prediction quality filter imposed on the normalized returns
pipe_output.groupby(pipe_output.index.get_level_values(0)).count().plot()
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5cf8b37650>

Below, the figure shows a plot of the range for non-normalized predicted returns over the timeline of the pipeline.

In [18]:
# this shows the range of predicted returns at each time period
pipe_output['predicted 5 day returns'].plot()
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5cd9647c10>

Because some of the stocks have only non-normalized returns predictions because their predictions do not meet the quality threshold, we need to drop the NaN values in the dataframe.

In [19]:
pipe_output.dropna().describe()
Out[19]:
normalized returns predicted 5 day returns sector
count 3.002100e+04 30021.000000 30021.000000
mean -4.519145e-18 0.000633 207.040771
std 1.000017e+00 0.034088 89.601828
min -8.495784e+00 -1.010000 -1.000000
25% -5.133776e-01 -0.013000 103.000000
50% -2.140750e-02 0.002000 206.000000
75% 4.977289e-01 0.014000 310.000000
max 9.974787e+00 1.249000 311.000000

Finally, the tear sheet is below.

In [20]:
assets = pipe_output.index.levels[1].unique()
pricing = get_pricing(assets, START, '2017-02-28', fields='open_price')
In [21]:
factor_data = alphalens.utils.get_clean_factor_and_forward_returns(
    pipe_output['predicted 5 day returns'],
    pricing,
    quantiles=5,
    groupby=pipe_output['sector'],
    periods=(1, 5, 10)
)
In [22]:
alphalens.tears.create_full_tear_sheet(factor_data)
Quantiles Statistics
min max mean std count count %
factor_quantile
1 -1.135 0.012 -0.030385 0.036150 22519 20.961361
2 -0.080 0.022 -0.009182 0.014120 21969 20.449405
3 -0.048 0.035 -0.000307 0.011998 21492 20.005399
4 -0.027 0.150 0.008884 0.012644 20839 19.397567
5 -0.011 1.249 0.031947 0.041082 20612 19.186268
Returns Analysis
1 5 10
Ann. alpha 0.198 0.086 0.022
beta -0.039 -0.043 -0.000
Mean Period Wise Return Top Quantile (bps) 8.166 25.111 26.954
Mean Period Wise Return Bottom Quantile (bps) -6.513 -6.907 4.228
Mean Period Wise Spread (bps) 14.926 6.439 2.239
Information Analysis
1 5 10
IC Mean 0.032 0.030 0.009
IC Std. 0.117 0.126 0.122
t-stat(IC) 4.299 3.817 1.203
p-value(IC) 0.000 0.000 0.230
IC Skew -0.146 -0.103 -0.230
IC Kurtosis 0.823 0.379 -0.358
Ann. IR 4.290 3.809 1.201
Turnover Analysis
1 5 10
Quantile 1 Mean Turnover 0.361 0.713 0.785
Quantile 2 Mean Turnover 0.594 0.770 0.791
Quantile 3 Mean Turnover 0.633 0.763 0.773
Quantile 4 Mean Turnover 0.600 0.786 0.806
Quantile 5 Mean Turnover 0.363 0.714 0.781
1 5 10
Mean Factor Rank Autocorrelation 0.702 0.097 -0.061
<matplotlib.figure.Figure at 0x7f5d4b6b8f10>
In [ ]: