Applying Alpha Vertext Machine Learning to a Mean-Reversion Strategy¶

by Jeremy Muhia

The objective of this study was to use data from the Alpha Vertex data set to enhance a short term mean reversion strategy. This data set contains data points that indicate 5 day future returns as predicted by machine learning algorithms. This data was used to screen for stocks that should or should not be included in the portfolio based on future returns quantiles. The rest of this notebook focuses on describing the Alpha Vertex data.

Run the cell below before beginning¶

from quantopian.pipeline import Pipeline
from quantopian.pipeline.filters import Q500US
from quantopian.pipeline.factors import CustomFactor
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.research import run_pipeline

import alphalens

# These imports can be found in the store panel for each dataset
# (https://www.quantopian.com/data). Note that not all store datasets
# can be used in pipeline yet.
from quantopian.pipeline.data.alpha_vertex import (
    # Top 100 Securities
    precog_top_100 as dataset_100,
    # Top 500 Securities
    precog_top_500 as dataset_500
)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



class PredictionQuality(CustomFactor):
    """
    create a customized factor to calculate the prediction quality
    for each stock in the universe.
    
    compares the percentage of predictions with the correct sign 
    over a rolling window (3 weeks) for each stock
   
    """
    # data used to create custom factor
    inputs = [dataset_500.predicted_five_day_log_return, USEquityPricing.close]
    
    # change this to what you want
    window_length = 15

    def compute(self, today, assets, out, pred_ret, px_close):
        # actual returns
        px_close_df = pd.DataFrame(data=px_close)
        pred_ret_df = pd.DataFrame(data=pred_ret)
        log_ret5_df = np.log(px_close_df) - np.log(px_close_df.shift(5))

        log_ret5_df = log_ret5_df.iloc[5:].reset_index(drop=True)
        n = len(log_ret5_df)
        
        # predicted returns
        pred_ret_df = pred_ret_df.iloc[:n]

        # number of predictions with incorrect sign
        err_df = (np.sign(log_ret5_df) - np.sign(pred_ret_df)).abs() / 2.0

        # custom quality measure
        pred_quality = (1 - pd.ewma(err_df, min_periods=n, com=n)).iloc[-1].values
        
        out[:] = pred_quality

        
        
class NormalizedReturn(CustomFactor):
    """
    Custom Factor to calculate the normalized forward return 
       
    scales the forward return expecation by the historical volatility
    of returns
    
    """

    # data used to create custom factor
    inputs = [dataset_500.predicted_five_day_log_return, USEquityPricing.close]
    
    # change this to what you want
    window_length = 10

    def compute(self, today, assets, out, pred_ret, px_close):
        # mean return 
        avg_ret = np.nanmean(pred_ret[-1], axis =0)
        
        # standard deviation of returns
        std_ret = np.nanstd(pred_ret[-1], axis=0)

        # normalized returns
        norm_ret = (pred_ret[-1] - avg_ret) / std_ret

        out[:] = norm_ret



START = '2016-01-01'
END = '2017-01-01'

MORNINGSTAR_SECTOR_CODES = {
     -1: 'Misc',
    101: 'Basic Materials',
    102: 'Consumer Cyclical',
    103: 'Financial Services',
    104: 'Real Estate',
    205: 'Consumer Defensive',
    206: 'Healthcare',
    207: 'Utilities',
    308: 'Communication Services',
    309: 'Energy',
    310: 'Industrials',
    311: 'Technology' ,    
}

We start by creating a universe of stocks that are included in the Q500 and also have a recent prediction in the Alpha Vertex Data.

Then, the PredictionQuality class is used to later create a filter that can be used to calculate normalized returns for stocks whose prediction is above a given quality threshold.

Finally, this normalized return value is used to create a pipeline from the beginning of 2016 to the beginning of 2017.

# get stocks covered in the Q500 that have recent prediction data in AlphaVertex
covered_stocks = Q500US() & dataset_500.predicted_five_day_log_return.latest.notnull()

prediction_quality = PredictionQuality(mask=covered_stocks)
quality = prediction_quality > 0.65
normalized_return = NormalizedReturn(mask=quality)

# create a pipeline of only stocks that are covered above
pipe = Pipeline(
    columns={
        'predicted 5 day returns' : dataset_500.predicted_five_day_log_return.latest,
        'normalized returns': normalized_return,
        'sector' : Sector(mask=covered_stocks)
    },
    screen=covered_stocks
)

# run the pipeline
pipe_output = run_pipeline(pipe, start_date=START, end_date=END)

Below is a sample of the Pipeline. Also, note that there are 457 unique stocks in the entire pipeline.

pipe_output.head()

# this is the number of unique securities in the dataframe over the entire year that the pipeline is run
len(pipe_output.index.get_level_values(1).unique())

457

The figure below shows how many stocks have a normalized return prediction (blue line), a non-normalized return prediction (green line), and a valid sector code (red line).

Notice that there are far fewer stocks with a normalized return prediction. My assumption is that because normalized returns are only calculated for stocks whose prediction overcome a certain prediction quality threshold, this results in a fraction of the stocks having normalized returns predictions. Also, the upper and lower bounds for stocks with normalized returns predictions tighten over time. I'm not sure why this happens.

# the green line shows the number of stocks at each time period with valid predicted returns
# the blue line shows the number of stocks at each time period with valid normalized returns

# the significant difference here is due to the prediction quality filter imposed on the normalized returns
pipe_output.groupby(pipe_output.index.get_level_values(0)).count().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f5cf8b37650>

Below, the figure shows a plot of the range for non-normalized predicted returns over the timeline of the pipeline.

# this shows the range of predicted returns at each time period
pipe_output['predicted 5 day returns'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f5cd9647c10>

Because some of the stocks have only non-normalized returns predictions because their predictions do not meet the quality threshold, we need to drop the NaN values in the dataframe.

pipe_output.dropna().describe()

Finally, the tear sheet is below.

assets = pipe_output.index.levels[1].unique()
pricing = get_pricing(assets, START, '2017-02-28', fields='open_price')

factor_data = alphalens.utils.get_clean_factor_and_forward_returns(
    pipe_output['predicted 5 day returns'],
    pricing,
    quantiles=5,
    groupby=pipe_output['sector'],
    periods=(1, 5, 10)
)

alphalens.tears.create_full_tear_sheet(factor_data)

Quantiles Statistics

Returns Analysis

Information Analysis

Turnover Analysis

<matplotlib.figure.Figure at 0x7f5d4b6b8f10>

		normalized returns	predicted 5 day returns	sector
2016-01-04 00:00:00+00:00	Equity(2 [ARNC])	-0.250973	-0.016	101
	Equity(24 [AAPL])	NaN	0.038	311
	Equity(62 [ABT])	0.252332	-0.005	206
	Equity(67 [ADSK])	0.618371	0.003	311
	Equity(76 [TAP])	0.206577	-0.006	205

	normalized returns	predicted 5 day returns	sector
count	3.002100e+04	30021.000000	30021.000000
mean	-4.519145e-18	0.000633	207.040771
std	1.000017e+00	0.034088	89.601828
min	-8.495784e+00	-1.010000	-1.000000
25%	-5.133776e-01	-0.013000	103.000000
50%	-2.140750e-02	0.002000	206.000000
75%	4.977289e-01	0.014000	310.000000
max	9.974787e+00	1.249000	311.000000

	min	max	mean	std	count	count %
factor_quantile
1	-1.135	0.012	-0.030385	0.036150	22519	20.961361
2	-0.080	0.022	-0.009182	0.014120	21969	20.449405
3	-0.048	0.035	-0.000307	0.011998	21492	20.005399
4	-0.027	0.150	0.008884	0.012644	20839	19.397567
5	-0.011	1.249	0.031947	0.041082	20612	19.186268

	1	5	10
Ann. alpha	0.198	0.086	0.022
beta	-0.039	-0.043	-0.000
Mean Period Wise Return Top Quantile (bps)	8.166	25.111	26.954
Mean Period Wise Return Bottom Quantile (bps)	-6.513	-6.907	4.228
Mean Period Wise Spread (bps)	14.926	6.439	2.239

	1	5	10
IC Mean	0.032	0.030	0.009
IC Std.	0.117	0.126	0.122
t-stat(IC)	4.299	3.817	1.203
p-value(IC)	0.000	0.000	0.230
IC Skew	-0.146	-0.103	-0.230
IC Kurtosis	0.823	0.379	-0.358
Ann. IR	4.290	3.809	1.201

	1	5	10
Quantile 1 Mean Turnover	0.361	0.713	0.785
Quantile 2 Mean Turnover	0.594	0.770	0.791
Quantile 3 Mean Turnover	0.633	0.763	0.773
Quantile 4 Mean Turnover	0.600	0.786	0.806
Quantile 5 Mean Turnover	0.363	0.714	0.781