Notebook
In [1]:
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats
import scipy
from statsmodels import regression
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Getting Data

Monthly values for research and development and total revenue for every equity.

In [2]:
start_date = '2012-01-01'
end_date = '2015-01-01'

# Get research and development and total revenue for all assets in universe.
fundamentals = init_fundamentals()

data = get_fundamentals(query(fundamentals.operation_ratios.revenue_growth, 
                              fundamentals.income_statement.research_and_development, 
                              fundamentals.income_statement.total_revenue, 
                              fundamentals.asset_classification.morningstar_sector_code)
                        .filter(fundamentals.asset_classification.morningstar_sector_code == 311), end_date, range_specifier='36m')
rnd_data = data['research_and_development'].T.dropna()
rev_data = data['total_revenue'].T.dropna()
rev_growth_data = data['revenue_growth'].T.dropna()

# Divide research and development by total revenue to get RnD as a percentage of total revenue.
rnd_to_rev_data = rnd_data / rev_data
rnd_to_rev_data = rnd_to_rev_data.dropna()
In [4]:
# Create a new factor that takes the average of revenue growth and RnD to Revenue values.
rnd_growth_data = (rev_growth_data + rnd_to_rev_data) / 2
rnd_growth_data = rnd_growth_data.dropna()
In [5]:
rnd_growth_data.head()
Out[5]:
2012-02-01 00:00:00+00:00 2012-03-01 00:00:00+00:00 2012-04-02 00:00:00+00:00 2012-05-01 00:00:00+00:00 2012-06-01 00:00:00+00:00 2012-07-02 00:00:00+00:00 2012-08-01 00:00:00+00:00 2012-09-04 00:00:00+00:00 2012-10-01 00:00:00+00:00 2012-11-01 00:00:00+00:00 ... 2014-04-01 00:00:00+00:00 2014-05-01 00:00:00+00:00 2014-06-02 00:00:00+00:00 2014-07-01 00:00:00+00:00 2014-08-01 00:00:00+00:00 2014-08-29 00:00:00+00:00 2014-10-01 00:00:00+00:00 2014-11-03 00:00:00+00:00 2014-11-26 00:00:00+00:00 2015-01-02 00:00:00+00:00
Equity(24 [AAPL]) 0.374509 0.374509 0.374509 0.305031 0.305031 0.305031 0.125418 0.125418 0.125418 0.148711 ... 0.039815 0.039004 0.039004 0.039004 -0.068563 -0.068563 -0.068563 0.082673 0.082673 0.082673
Equity(67 [ADSK]) 0.204106 0.204106 0.187486 0.187486 0.186785 0.186785 0.186785 0.147897 0.147897 0.147897 ... 0.121700 0.121700 0.148911 0.148911 0.148911 0.178353 0.178353 0.178353 0.133796 0.133796
Equity(114 [ADBE]) 0.156337 0.156337 0.093540 0.093540 0.093540 0.129929 0.129929 0.129929 0.120765 0.120765 ... 0.100904 0.100904 0.100904 0.131910 0.131910 0.131910 0.076060 0.076060 0.076060 0.133321
Equity(122 [ADI]) 0.051526 0.051526 0.051526 0.051526 0.051526 0.051526 0.051526 0.051526 0.051526 0.051526 ... 0.107292 0.107292 0.107292 0.124855 0.124855 0.135989 0.135989 0.135989 0.135989 0.195415
Equity(312 [ALOT]) 0.084339 0.084339 -0.341516 -0.341516 -0.341516 -0.341516 -0.341516 -0.109298 -0.109298 -0.109298 ... 0.088842 0.088842 0.118709 0.118709 0.118709 0.071381 0.071381 0.071381 0.071381 0.051035

5 rows × 36 columns

In [6]:
rnd_to_rev_data.head(5)
Out[6]:
2012-02-01 00:00:00+00:00 2012-03-01 00:00:00+00:00 2012-04-02 00:00:00+00:00 2012-05-01 00:00:00+00:00 2012-06-01 00:00:00+00:00 2012-07-02 00:00:00+00:00 2012-08-01 00:00:00+00:00 2012-09-04 00:00:00+00:00 2012-10-01 00:00:00+00:00 2012-11-01 00:00:00+00:00 ... 2014-04-01 00:00:00+00:00 2014-05-01 00:00:00+00:00 2014-06-02 00:00:00+00:00 2014-07-01 00:00:00+00:00 2014-08-01 00:00:00+00:00 2014-08-29 00:00:00+00:00 2014-10-01 00:00:00+00:00 2014-11-03 00:00:00+00:00 2014-11-26 00:00:00+00:00 2015-01-02 00:00:00+00:00
Equity(24 [AAPL]) 0.016360 0.016360 0.016360 0.021462 0.021462 0.021462 0.025012 0.025012 0.025012 0.025190 ... 0.023093 0.031153 0.031153 0.031153 0.042824 0.042824 0.042824 0.040026 0.040026 0.040026
Equity(67 [ADSK]) 0.257382 0.257382 0.252363 0.252363 0.259429 0.259429 0.259429 0.254792 0.254792 0.254792 ... 0.276850 0.276850 0.287764 0.287764 0.287764 0.281431 0.281431 0.281431 0.297573 0.297573
Equity(114 [ADBE]) 0.169597 0.169597 0.170039 0.170039 0.170039 0.160881 0.160881 0.160881 0.175040 0.175040 ... 0.209500 0.209500 0.209500 0.195741 0.195741 0.195741 0.210908 0.210908 0.210908 0.199088
Equity(122 [ADI]) 0.172997 0.172997 0.172997 0.172997 0.172997 0.172997 0.172997 0.172997 0.172997 0.172997 ... 0.204773 0.204773 0.204773 0.196186 0.196186 0.192504 0.192504 0.192504 0.192504 0.190111
Equity(312 [ALOT]) 0.071855 0.071855 0.207222 0.207222 0.207222 0.207222 0.207222 0.060356 0.060356 0.060356 ... 0.082046 0.082046 0.065996 0.065996 0.065996 0.066127 0.066127 0.066127 0.066127 0.067597

5 rows × 36 columns

The next step is to figure out which equities we have data for. Data sources are never perfect, and stocks go in and out of existence with Mergers, Acquisitions, and Bankruptcies. We'll make a list of the stocks common to both our factor data sets and then filter down both to just those stocks.

We'll also get the daily pricing of all stocks over the same time period. Finally, we'll filter down one more time to just look at stocks which have data from all three sources.

In [14]:
# There will be some equities for which we had no data, so look at the set for which we have data
common_equities = rnd_to_rev_data.index.intersection(rev_growth_data.index)

# Get the prices on only the common equities
# WARNING: The following line will take a while to run, as it is fetching a large amount of data.
prices = get_pricing(common_equities, start_date=start_date, end_date=end_date, frequency='daily')
prices = prices['price']

# Drop any that have no price data
prices = prices.T.dropna().T
common_equities = prices.T.index

# Filter the fundamental data down to only the equities that have price data
rnd_data_filtered = rnd_data.T[common_equities]
rev_data_filtered = rev_data.T[common_equities]
rnd_to_rev_data_filtered = rnd_to_rev_data.T[common_equities]
rnd_growth_data_filtered = rnd_growth_data.T[common_equities]

Now we want to compute the forward 30 day returns for each month. We do this by dividing the price on the last day of the month by the price on the first day of the month. Pandas has a nice function groupby, which can accomplish this for us fairly elegantly.

In [15]:
monthly_prices = prices.groupby((prices.index.year, prices.index.month))
month_forward_returns = monthly_prices.last() / monthly_prices.first() - 1

Let's take a look to see what we have.

In [16]:
month_forward_returns.T.head(5)
Out[16]:
2012 ... 2014
1 2 3 4 5 6 7 8 9 10 ... 3 4 5 6 7 8 9 10 11 12
Equity(24 [AAPL]) 0.110387 0.188934 0.101422 -0.055982 -0.007912 0.040884 0.030446 0.096202 -0.011206 -0.097761 ... 0.017044 0.089155 0.070431 0.035328 0.022027 0.066192 -0.024685 0.088517 0.087310 -0.040671
Equity(67 [ADSK]) 0.168506 0.023237 0.113977 -0.069927 -0.213654 0.134890 -0.016529 -0.087276 0.062082 -0.046135 ... -0.048006 -0.040751 0.081371 0.073482 -0.066165 0.005624 0.030677 0.033956 0.073221 -0.007354
Equity(114 [ADBE]) 0.081790 0.053474 0.036556 -0.032853 -0.080569 0.084172 -0.040696 0.019557 0.032304 0.042932 ... -0.031167 -0.064462 0.031155 0.119604 -0.053417 0.042640 -0.040089 0.037525 0.053926 -0.014100
Equity(122 [ADI]) 0.085112 -0.003811 0.031945 -0.023063 -0.068025 0.054001 0.050249 0.010163 -0.004825 -0.009873 ... 0.050405 -0.051415 0.030703 0.032461 -0.089508 0.022196 -0.022126 0.027973 0.098954 0.015731
Equity(328 [ALTR]) 0.057120 -0.030028 0.047344 -0.097462 -0.051504 0.048327 0.050682 0.047112 -0.082861 -0.099941 ... 0.011725 -0.107084 0.032731 0.050136 -0.075314 0.077463 0.011306 -0.019960 0.082590 -0.002968

5 rows × 36 columns

Let's take a look at the RnD to Revenue data.

In [17]:
rnd_to_rev_data_filtered.head(5)
Out[17]:
Equity(24 [AAPL]) Equity(67 [ADSK]) Equity(114 [ADBE]) Equity(122 [ADI]) Equity(328 [ALTR]) Equity(337 [AMAT]) Equity(351 [AMD]) Equity(393 [AMSC]) Equity(397 [AMSW_A]) Equity(450 [CLFD]) ... Equity(40807 [NPTN]) Equity(41048 [MX]) Equity(41098 [CSOD]) Equity(41142 [SREV]) Equity(41243 [ELLI]) Equity(41491 [FSL]) Equity(41529 [WILN]) Equity(41601 [RATE]) Equity(41762 [TNGO]) Equity(42027 [UBNT])
2012-02-01 00:00:00+00:00 0.016360 0.257382 0.169597 0.172997 0.154593 0.122879 0.213609 0.349808 0.076139 0.02926 ... 0.109687 0.101208 0.184067 0.073335 0.208636 0.169205 0.139137 0.031606 0.129525 0.049383
2012-03-01 00:00:00+00:00 0.016360 0.257382 0.169597 0.172997 0.197235 0.138876 0.211709 0.349808 0.081899 0.02926 ... 0.193044 0.103149 0.113508 0.056319 0.219313 0.185587 0.104194 0.035581 0.107452 0.049383
2012-04-02 00:00:00+00:00 0.016360 0.252363 0.170039 0.172997 0.197235 0.138876 0.211709 0.349808 0.081899 0.02926 ... 0.193044 0.103149 0.113508 0.056319 0.219313 0.185587 0.104194 0.035581 0.107452 0.049383
2012-05-01 00:00:00+00:00 0.021462 0.252363 0.170039 0.172997 0.214452 0.138876 0.232177 0.349808 0.081899 0.02926 ... 0.193044 0.103149 0.113508 0.079567 0.219313 0.190526 0.104194 0.035581 0.107452 0.049383
2012-06-01 00:00:00+00:00 0.021462 0.259429 0.170039 0.172997 0.214452 0.126328 0.232177 0.207217 0.081899 0.02926 ... 0.194346 0.103149 0.113508 0.079567 0.197694 0.190526 0.104194 0.035386 0.108033 0.050390

5 rows × 367 columns

Because we're dealing with ranking systems, at several points we're going to want to rank our data. Let's check how our data looks when ranked to get a sense for this.

In [18]:
rnd_to_rev_data_filtered.rank().head(5)
Out[18]:
Equity(24 [AAPL]) Equity(67 [ADSK]) Equity(114 [ADBE]) Equity(122 [ADI]) Equity(328 [ALTR]) Equity(337 [AMAT]) Equity(351 [AMD]) Equity(393 [AMSC]) Equity(397 [AMSW_A]) Equity(450 [CLFD]) ... Equity(40807 [NPTN]) Equity(41048 [MX]) Equity(41098 [CSOD]) Equity(41142 [SREV]) Equity(41243 [ELLI]) Equity(41491 [FSL]) Equity(41529 [WILN]) Equity(41601 [RATE]) Equity(41762 [TNGO]) Equity(42027 [UBNT])
2012-02-01 00:00:00+00:00 2 9.5 7.5 6 1.0 1 18.0 34.5 3.0 25.5 ... 1 30.0 36.0 10.0 30 1.0 36 1.0 36 5.5
2012-03-01 00:00:00+00:00 2 9.5 7.5 6 5.5 10 16.5 34.5 10.5 25.5 ... 32 33.5 11.5 1.5 35 26.5 19 12.0 25 5.5
2012-04-02 00:00:00+00:00 2 4.5 10.0 6 5.5 10 16.5 34.5 10.5 25.5 ... 32 33.5 11.5 1.5 35 26.5 19 12.0 25 5.5
2012-05-01 00:00:00+00:00 8 4.5 10.0 6 26.0 10 20.0 34.5 10.5 25.5 ... 32 33.5 11.5 15.0 35 32.0 19 12.0 25 5.5
2012-06-01 00:00:00+00:00 8 12.0 10.0 6 26.0 3 20.0 22.0 10.5 25.5 ... 35 33.5 11.5 15.0 21 32.0 19 9.5 28 9.0

5 rows × 367 columns

Looking at Correlations Over Time

Now that we have the data, let's do something with it. Our first analysis will be to measure the monthly Spearman rank correlation coefficient between RnD and month-forward returns. In other words, how predictive of 30-day returns is ranking your universe by RnD.

In [20]:
scores = np.zeros(36)
pvalues = np.zeros(36)
for i in range(36):
    score, pvalue = stats.spearmanr(rnd_to_rev_data_filtered.iloc[i], month_forward_returns.iloc[i])
    pvalues[i] = pvalue
    scores[i] = score
    
plt.bar(range(1,37),scores)
plt.hlines(np.mean(scores), 1, 37, colors='r', linestyles='dashed')
plt.xlabel('Month')
plt.xlim((1, 37))
plt.legend(['Mean Correlation over All Months', 'Monthly Rank Correlation'])
plt.ylabel('Rank correlation between RnD to Revenue and 30-day forward returns');

We can see that the average correlation is negative and varies a lot from month to month.

Let's look at the same analysis, but use the RnD growth factor we created above.

In [21]:
scores = np.zeros(36)
pvalues = np.zeros(36)
for i in range(36):
    score, pvalue = stats.spearmanr(rnd_growth_data_filtered.iloc[i], month_forward_returns.iloc[i])
    pvalues[i] = pvalue
    scores[i] = score
    
plt.bar(range(1,37),scores)
plt.hlines(np.mean(scores), 1, 37, colors='r', linestyles='dashed')
plt.xlabel('Month')
plt.xlim((1, 37))
plt.legend(['Mean Correlation over All Months', 'Monthly Rank Correlation'])
plt.ylabel('Rank correlation between PE Ratio and 30-day forward returns');

Basket Returns

The next step is to compute the returns of baskets taken out of our ranking. If we rank all equities and then split them into $n$ groups, what would the mean return be of each group? We can answer this question in the following way. The first step is to create a function that will give us the mean return in each basket in a given the month and a ranking factor.

In [22]:
def compute_basket_returns(factor_data, forward_returns, number_of_baskets, month):
    data = pd.DataFrame(factor_data.iloc[month-1]).join(forward_returns.iloc[month-1])
    # Rank the equities on the factor values
    data.columns = ['Factor Value', 'Month Forward Returns']
    data.sort('Factor Value', inplace=True)
    
    # How many equities per basket
    equities_per_basket = np.floor(len(data.index) / number_of_baskets)

    basket_returns = np.zeros(number_of_baskets)

    # Compute the returns of each basket
    for i in range(number_of_baskets):
        start = i * equities_per_basket
        if i == number_of_baskets - 1:
            # Handle having a few extra in the last basket when our number of equities doesn't divide well
            end = len(data.index) - 1
        else:
            end = i * equities_per_basket + equities_per_basket
        # Actually compute the mean returns for each basket
        basket_returns[i] = data.iloc[start:end]['Month Forward Returns'].mean()
        
    return basket_returns

The first thing we'll do with this function is compute this for each month and then average. This should give us a sense of the relationship over a long timeframe.

In [23]:
number_of_baskets = 10
mean_basket_returns = np.zeros(number_of_baskets)
for m in range(1, 37):
    basket_returns = compute_basket_returns(rnd_to_rev_data_filtered, month_forward_returns, number_of_baskets, m)
    mean_basket_returns += basket_returns

mean_basket_returns /= 36    

# Plot the returns of each basket
plt.bar(range(number_of_baskets), mean_basket_returns)
plt.ylabel('Returns')
plt.xlabel('Basket')
plt.legend(['Returns of Each Basket']);

Spread Consistency

Of course, that's just the average relationship. To get a sense of how consistent this is, and whether or not we would want to trade on it, we should look at it over time. Here we'll look at the monthly spreads for the first year.

In [24]:
f, axarr = plt.subplots(3, 4)
for month in range(1, 13):
    basket_returns = compute_basket_returns(rnd_to_rev_data_filtered, month_forward_returns, 10, month)

    r = np.floor((month-1) / 4)
    c = (month-1) % 4
    axarr[r, c].bar(range(number_of_baskets), basket_returns)
    axarr[r, c].xaxis.set_visible(False) # Hide the axis lables so the plots aren't super messy
    axarr[r, c].set_title('Month ' + str(month))

We'll repeat the same analysis for RnD Growth Ratio.

In [25]:
number_of_baskets = 10
mean_basket_returns = np.zeros(number_of_baskets)
for m in range(1, 37):
    basket_returns = compute_basket_returns(rnd_growth_data_filtered, month_forward_returns, number_of_baskets, m)
    mean_basket_returns += basket_returns

mean_basket_returns /= 36    

# Plot the returns of each basket
plt.bar(range(number_of_baskets), mean_basket_returns)
plt.ylabel('Returns')
plt.xlabel('Basket')
plt.legend(['Returns of Each Basket']);
In [26]:
f, axarr = plt.subplots(3, 4)
for month in range(1, 13):
    basket_returns = compute_basket_returns(rnd_growth_data_filtered, month_forward_returns, 10, month)

    r = np.floor((month-1) / 4)
    c = (month-1) % 4
    axarr[r, c].bar(range(10), basket_returns)
    axarr[r, c].xaxis.set_visible(False) # Hide the axis lables so the plots aren't super messy
    axarr[r, c].set_title('Month ' + str(month))

Sometimes Factors are Just Other Factors

Often times a new factor will be discovered that seems to induce spread, but it turns out that it is just a new and potentially more complicated way to compute a well known factor. Consider for instance the case in which you have poured tons of resources into developing a new factor, it looks great, but how do you know it's not just another factor in disguise?

To check for this, there are many analyses that can be done.

Correlation Analysis

One of the most intuitive ways is to check what the correlation of the factors is over time. We'll plot that here.

In [27]:
scores = np.zeros(36)
pvalues = np.zeros(36)
for i in range(36):
    score, pvalue = stats.spearmanr(rnd_to_rev_data_filtered.iloc[i], rnd_growth_data_filtered.iloc[i])
    pvalues[i] = pvalue
    scores[i] = score
    
plt.bar(range(1,37),scores)
plt.hlines(np.mean(scores), 1, 37, colors='r', linestyles='dashed')
plt.xlabel('Month')
plt.xlim((1, 37))
plt.legend(['Mean Correlation over All Months', 'Monthly Rank Correlation'])
plt.ylabel('Rank correlation between Market Cap and PE Ratio');

And also the p-values because the correlations may not be that meaningful by themselves.

In [28]:
scores = np.zeros(36)
pvalues = np.zeros(36)
for i in range(36):
    score, pvalue = stats.spearmanr(rnd_to_rev_data_filtered.iloc[i], rnd_growth_data_filtered.iloc[i])
    pvalues[i] = pvalue
    scores[i] = score
    
plt.bar(range(1,37),pvalues)
plt.xlabel('Month')
plt.xlim((1, 37))
plt.legend(['Mean Correlation over All Months', 'Monthly Rank Correlation'])
plt.ylabel('Rank correlation between Market Cap and PE Ratio');

There is interesting behavior, and further analysis would be needed to determine whether a relationship existed.

In [29]:
rnd_dataframe = pd.DataFrame(rnd_to_rev_data_filtered.iloc[0])
rnd_dataframe.columns = ['F1']
rnd_growth_dataframe = pd.DataFrame(rnd_growth_data_filtered.iloc[0])
rnd_growth_dataframe.columns = ['F2']
returns_dataframe = pd.DataFrame(month_forward_returns.iloc[0])
returns_dataframe.columns = ['Returns']

data = rnd_dataframe.join(rnd_growth_dataframe).join(returns_dataframe)

data = data.rank(method='first')

heat = np.zeros((len(data), len(data)))

for e in data.index:
    F1 = data.loc[e]['F1']
    F2 = data.loc[e]['F2']
    R = data.loc[e]['Returns']
    heat[F1-1, F2-1] += R
    
heat = scipy.signal.decimate(heat, 40)
heat = scipy.signal.decimate(heat.T, 40).T

p = sns.heatmap(heat, xticklabels=[], yticklabels=[])
# p.xaxis.set_ticks([])
# p.yaxis.set_ticks([])
p.xaxis.set_label_text('F1 Rank')
p.yaxis.set_label_text('F2 Rank')
p.set_title('Sum Rank of Returns vs Factor Ranking');
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-29-9e83b582036e> in <module>()
     16     F2 = data.loc[e]['F2']
     17     R = data.loc[e]['Returns']
---> 18     heat[F1-1, F2-1] += R
     19 
     20 heat = scipy.signal.decimate(heat, 40)

IndexError: index -9223372036854775808 is out of bounds for axis 0 with size 367

How to Choose Ranking System

The ranking system is the secret sauce of many strategies. Choosing a good ranking system, or factor, is not easy and the subject of much research. We'll discuss a few starting points here.

Clone and Tweak

Choose one that is commonly discussed and see if you can modify it slightly to gain back an edge. Often times factors that are public will have no signal left as they have been completely arbitraged out of the market. However, sometimes they lead you in the right direction of where to go.

Pricing Models

Any model that predicts future returns can be a factor. The future return predicted is now that factor, and can be used to rank your universe. You can take any complicated pricing model and transform it into a ranking.

Price Based Factors (Technical Indicators)

Price based factors take information about the historical price of each equity and use it to generate the factor value. Examples could be 30-day momentum, or volatility measures.

Reversion vs. Momentum

It's important to note that some factors bet that prices, once moving in a direction, will continue to do so. Some factors bet the opposite. Both are valid models on different time horizons and assets, and it's important to investigate whether the underlying behavior is momentum or reversion based.

Fundamental Factors (Value Based)

This is using combinations of fundamental values as we discussed today. Fundamental values contain information that is tied to real world facts about a company, so in many ways can be more robust than prices.

The Arms Race

Ultimately, developing predictive factors is an arms race in which you are trying to stay one step ahead. Factors get arbitraged out of markets and have a lifespan, so it's important that you are constantly doing work to determine how much decay your factors are experiencing, and what new factors might be used to take their place.