Notebook

Factor Combination Theory and Tools

The purpose of this notebook is to provide a set of functions to allow the user to explore how a combination of uncorrelated and possibly "interacting" factors can result in an enhanced signal. It also provides some tools to detect interaction effects between factors. <br><br>

Note: The following link does a good job of illustrating how to interpret factor interaction plots (although it is in a completely different context from the finance field). https://courses.washington.edu/smartpsy/interactions.htm

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm

from scipy import stats
import alphalens as al

Factor Interaction Analysis Functions

These will be used later when analyzing the combination of factors.

In [2]:
def mean_return_by_quantile(factor_data,
                            by_date=False,
                            by_group=False,
                            demeaned=True,
                            group_adjust=False,
                            factor_groupers=['factor_quantile']):
    """
    Computes mean returns for factor quantiles across
    provided forward returns columns.
    Parameters
    ----------
    factor_data : pd.DataFrame - MultiIndex
        A MultiIndex DataFrame indexed by date (level 0) and asset (level 1),
        containing the values for a single alpha factor, forward returns for
        each period, the factor quantile/bin that factor value belongs to, and
        (optionally) the group the asset belongs to.
        - See full explanation in utils.get_clean_factor_and_forward_returns
    by_date : bool
        If True, compute quantile bucket returns separately for each date.
    by_group : bool
        If True, compute quantile bucket returns separately for each group.
    demeaned : bool
        Compute demeaned mean returns (long short portfolio)
    group_adjust : bool
        Returns demeaning will occur on the group level.
    factor_groupers: list
        list of column names (strings) for the factor quantiles to group by
    Returns
    -------
    mean_ret : pd.DataFrame
        Mean period wise returns by specified factor quantile.
    std_error_ret : pd.DataFrame
        Standard error of returns by specified quantile.
    """

    if group_adjust:
        grouper = [factor_data.index.get_level_values('date')] + ['group']
        factor_data = al.utils.demean_forward_returns(factor_data, grouper)
    elif demeaned:
        factor_data = al.utils.demean_forward_returns(factor_data)
    else:
        factor_data = factor_data.copy()

    grouper = factor_groupers
    if by_date:
        grouper.append(factor_data.index.get_level_values('date'))

    if by_group:
        grouper.append('group')

    group_stats = factor_data.groupby(grouper)[
        al.utils.get_forward_returns_columns(factor_data.columns)] \
        .agg(['mean', 'std', 'count'])

    mean_ret = group_stats.T.xs('mean', level=1).T

    std_error_ret = group_stats.T.xs('std', level=1).T \
        / np.sqrt(group_stats.T.xs('count', level=1).T)

    return mean_ret, std_error_ret

def plot_multi_factor_quantile_returns(mean_ret_by_quantile, period, ax=None):
    """
    Plots mean period wise returns for factor quantiles.
    Parameters
    ----------
    mean_ret_by_q : pd.DataFrame
        DataFrame with quantiles, (group) and mean period wise return values.
    period: pandas.Timedelta or string
        Length of period for which the returns are computed (e.g. 1 day)
        if 'period' is a string it must follow pandas.Timedelta constructor
        format (e.g. '1 days', '1D', '30m', '3h', '1D1h', etc)
    ax : matplotlib.Axes, optional
        Axes upon which to plot.
    Returns
    -------
    ax : matplotlib.Axes
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 6))

        sns.heatmap(mean_ret_by_quantile[period].unstack(), annot=True,
                    cmap=cm.coolwarm_r, ax=ax, center=0)
        ax.set(title="Mean {} Returns".format(period))
    else:
        sns.heatmap(mean_ret_by_quantile[period].unstack(), annot=True,
                    cmap=cm.coolwarm_r, ax=ax, center=0)
        ax.set(title="Mean {} Returns".format(period))

Generate Simulated Factor Values and Returns

Let's generate some randomly uncorrelated factor values in addition to a return stream that is a function of the factor values and their interaction.

$ r = \beta_1f_1 + \beta_2f_2 + \beta_3f_1f_2 + \epsilon $

In [3]:
def generate_factor(n_stocks, n_periods):
    """Generate random factor values for given number of stocks and periods
    
    Parameters
    -----------
    n_stocks: int
        Number of stocks in simulation
    n_periods: int
        Number of days
    
    Return
    -------
    pd.Series
        Multi-index series of factor values (index by date, then asset)
    """
    factor = np.random.normal(0, size=n_stocks * n_periods)
    date_idx = pd.DatetimeIndex(start='2003-01-01', periods=n_periods, freq='B')
    idx = pd.MultiIndex.from_product([date_idx, (range(n_stocks))])
    factor = pd.Series(factor, idx)
    factor.index.names=['date', 'asset']
    return factor
    
def generate_simulated_returns(factor_1, factor_2, factor_1_coef, factor_2_coef, interaction_coef):
    """Generate simulated returns as a function of the factor_1, factor_2, and factor_1*factor_2
    values.
    
    Parameters
    ----------
    factor_1, factor_2: pd.Series
        Series indexed by date, then asset containing factor values
    factor_1_coef, factor_2_coef, interaction_coef: float
        The "True" Factor loadings for the simulated return stream.
    
    Returns
    -------
    pd.Series:
        Daily return series indexed by date and then asset
    """
    ret = (factor_1_coef * factor_1) + (factor_2_coef * factor_2) + \
          (interaction_coef * factor_1*factor_2)
    noise = np.random.normal(0,0.02, size=len(factor_1))
    ret = ret + noise
    return pd.Series(ret, index=factor_1.index)

The following function will be used at the end of the notebook to quickly generate additional examples

In [4]:
def simulate_and_plot_results(n_stocks, n_periods, factor_1_coef, factor_2_coef, interaction_coef):
    """Perform Entire Simulation and Plot Results in One Step"""
    factor_1 = generate_factor(N_STOCKS, N_PERIODS)
    factor_2 = generate_factor(N_STOCKS, N_PERIODS)

    sim_returns = generate_simulated_returns(factor_1, factor_2, factor_1_coef, factor_2_coef, interaction_coef)
    
    factor_data_1 = al.utils.get_clean_factor(factor_1, pd.DataFrame({'1D': sim_returns}))
    factor_data_2 = al.utils.get_clean_factor(factor_2, pd.DataFrame({'1D': sim_returns}))
    factor_data_1.rename(columns={'factor': 'factor_1', 'factor_quantile': 'factor_1_quantile'}, inplace=True)
    factor_data_2.rename(columns={'factor': 'factor_2', 'factor_quantile': 'factor_2_quantile'}, inplace=True)
    multi_factor_data = factor_data_1.join(factor_data_2[['factor_2', 'factor_2_quantile']])
    
    mean_ret_by_q = mean_return_by_quantile(multi_factor_data, 
                                        factor_groupers=['factor_1_quantile','factor_2_quantile'])[0]
    print "------------------------------------------------------------"
    print  "Mean Return by Factor Quantile for Each Factor Individually"
    print  "-----------------------------------------------------------"
    print mean_ret_by_q.groupby(level=0).mean(), '\n', mean_ret_by_q.groupby(level=1).mean()
    print "------------------------------------------------------------"
    print "Mean 1 Day Returns for Each Factor 1 and 2 Quantile Intersection"
    print "----------------------------------------------------------------"
    plot_multi_factor_quantile_returns(mean_ret_by_q, '1D', ax=None)
    mean_ret_by_q['1D'].unstack().plot(title='Factor Interaction Plot')
    plt.gca().set_ylabel('Return');
    

Set Parameters for the Simulation and Generate Data

Let's choose some parameters for our simulation and then generate the simulated factor values and stock returns. For this simulation, I am going to set the factor_2 coefficient to 0. In other words, the value of factor 2 will not be predictive of future returns. However, the interaction coefficient will have a positive loading.

In [5]:
N_STOCKS = 5
N_PERIODS = 1000
FACTOR_1_COEF = 0.05
FACTOR_2_COEF = 0 
INTERACTION_COEF = 0.05

factor_1 = generate_factor(N_STOCKS, N_PERIODS)
factor_2 = generate_factor(N_STOCKS, N_PERIODS)

sim_returns = generate_simulated_returns(factor_1, factor_2, FACTOR_1_COEF, FACTOR_2_COEF, INTERACTION_COEF)

Distribution of Simulated Returns

This is simply exploratory, just to see what kind of return distribution was generated. There appears to be some positive excess kurtosis.

In [6]:
fig, ax = plt.subplots(ncols=2)
ax[0].hist(sim_returns, bins=30);
ax[0].set(title='Distribution of Simulated Returns')
stats.probplot(sim_returns, plot=ax[1])
stats.describe(sim_returns)
Out[6]:
DescribeResult(nobs=5000, minmax=(-0.479193910090562, 0.43861417253908219), mean=-0.00073856949450228125, variance=0.0056630401883038008, skewness=-0.03691830849911114, kurtosis=3.377571351421362)

Evaluate Correlation between Factor Values

Let's verify that the correlation between factors is zero.

In [7]:
h = sns.jointplot(factor_1, factor_2, annot_kws={'title': 'Factor Correlation'});
h.set_axis_labels('factor_1', 'factor_2', fontsize=16);

Plot Factor Values vs. Returns

As opposed to generating alphalens output for these, I just decided to show a simple scatter plot for each factor and factor interaction to show whether each factor (or interaction term) was predictive of future returns on its own.

In [8]:
fig, axes = plt.subplots(nrows=2, ncols=2)
for factor, ax, title in zip([factor_1, factor_2, factor_1*factor_2] ,axes.flat, 
                             ['Factor 1', 'Factor 2', 'Interaction']):
    sns.regplot(factor, sim_returns, ax=ax)
    ax.set(title=title, xlabel='Factor Value', ylabel='Return')
fig.tight_layout()

Generate Data Structure to Feed into Interaction Analysis Functions Created Above

In [9]:
factor_data_1 = al.utils.get_clean_factor(factor_1, pd.DataFrame({'1D': sim_returns}))
factor_data_2 = al.utils.get_clean_factor(factor_2, pd.DataFrame({'1D': sim_returns}))
factor_data_1.rename(columns={'factor': 'factor_1', 'factor_quantile': 'factor_1_quantile'}, inplace=True)
factor_data_2.rename(columns={'factor': 'factor_2', 'factor_quantile': 'factor_2_quantile'}, inplace=True)
multi_factor_data = factor_data_1.join(factor_data_2[['factor_2', 'factor_2_quantile']])
multi_factor_data.head()
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Out[9]:
1D factor_1 factor_1_quantile factor_2 factor_2_quantile
date asset
2003-01-01 0 0.154096 1.479668 5 0.941797 4
1 0.069903 0.993258 3 -0.385766 2
2 -0.010461 -0.371420 1 0.594015 3
3 -0.079000 0.995555 4 -1.760521 1
4 0.024042 0.202703 2 1.975039 5
In [10]:
mean_ret_by_q = mean_return_by_quantile(multi_factor_data, 
                                        factor_groupers=['factor_1_quantile','factor_2_quantile'])[0]
In [11]:
print  "Mean Return by Factor Quantile for Each Factor Individually"
print  "-----------------------------------------------------------"
print mean_ret_by_q.groupby(level=0).mean(), '\n', mean_ret_by_q.groupby(level=1).mean()
Mean Return by Factor Quantile for Each Factor Individually
-----------------------------------------------------------
                         1D
factor_1_quantile          
1                 -0.061164
2                 -0.024412
3                 -0.000660
4                  0.025489
5                  0.058594 
                         1D
factor_2_quantile          
1                 -0.000674
2                 -0.000961
3                  0.000577
4                 -0.002400
5                  0.001305
In [12]:
print "Mean 1 Day Returns for Each Factor 1 and 2 Quantile Intersection"
print "----------------------------------------------------------------"
plot_multi_factor_quantile_returns(mean_ret_by_q, '1D', ax=None)
Mean 1 Day Returns for Each Factor 1 and 2 Quantile Intersection
----------------------------------------------------------------

Factor Interaction Plot

This plot illustrates the heat map above in a different type of visualization. The key in this plot is to look at how the slope of the line changes as we alter the factor_2_quantile variable. Since the slope changes as we change the factor_2_quantile, it suggests that there is an "non-additive" interaction between factor 1 and factor 2. In fact, this is a case where factor 2 had no predictive ability by itself. However, when combined with factor 1 it actually can enhance the predictability of the entire model.

In [13]:
mean_ret_by_q['1D'].unstack().plot(title='Factor Interaction Plot');
plt.ylabel('Return');

Note: If it was desirable to elminate the individual factor exposure and only have exposure to the "interaction factor", it might make sense to neutralize your factor exposure by going long the (Q5, Q5) and (Q1, Q1) bins while going short the (Q1, Q5) and (Q5, Q1) bins.

Example 2: Uncorrelated Factors with No Interaction

In [14]:
N_STOCKS = 5
N_PERIODS = 1000
FACTOR_1_COEF = 0.05
FACTOR_2_COEF = 0.05
INTERACTION_COEF = 0.


simulate_and_plot_results(N_STOCKS, N_PERIODS, FACTOR_1_COEF, FACTOR_2_COEF, INTERACTION_COEF)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
------------------------------------------------------------
Mean Return by Factor Quantile for Each Factor Individually
-----------------------------------------------------------
                         1D
factor_1_quantile          
1                 -0.059193
2                 -0.023659
3                  0.000865
4                  0.023500
5                  0.058474 
                         1D
factor_2_quantile          
1                 -0.059858
2                 -0.025026
3                 -0.001189
4                  0.026909
5                  0.059151
------------------------------------------------------------
Mean 1 Day Returns for Each Factor 1 and 2 Quantile Intersection
----------------------------------------------------------------

Note how both factors seem to have an effect on return, but the effect is "additive". There is no change in slope when varying the factor_2_quantile.

Example 3: Both Factors Predicitve with an Interaction Term

In [15]:
N_STOCKS = 5
N_PERIODS = 1000
FACTOR_1_COEF = 0.05
FACTOR_2_COEF = 0.05
INTERACTION_COEF = 0.05


simulate_and_plot_results(N_STOCKS, N_PERIODS, FACTOR_1_COEF, FACTOR_2_COEF, INTERACTION_COEF)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
------------------------------------------------------------
Mean Return by Factor Quantile for Each Factor Individually
-----------------------------------------------------------
                         1D
factor_1_quantile          
1                 -0.056899
2                 -0.024452
3                 -0.000953
4                  0.027184
5                  0.054621 
                         1D
factor_2_quantile          
1                 -0.056111
2                 -0.025018
3                 -0.000684
4                  0.021922
5                  0.059389
------------------------------------------------------------
Mean 1 Day Returns for Each Factor 1 and 2 Quantile Intersection
----------------------------------------------------------------

The individual positive effect for both variables and interaction effect is clearly visible.

Next Steps

  1. It may be educational to modify the factor generation function to allow for factor values that are correlated.
  2. Use an example of two real-life factors, and analyze potential combination and interaction effects.
  3. Suggest new standard charts/tools for analyzing combinations of factors that could be used in an alphalens factor combination/interaction tearsheet.