Jonathan Larkin
August, 2017
In developing a Pairs Trading strategy, finding valid, eligible pairs which exhibit unconditional mean-reverting behavior is of critical importance. This notebook walks through an example implementation of finding eligible pairs. We show how popular algorithms from Machine Learning can help us navigate a very high-dimensional seach space to find tradeable pairs.
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn import preprocessing
from statsmodels.tsa.stattools import coint
from scipy import stats
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.filters.morningstar import Q500US, Q1500US, Q3000US
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
print "Numpy: %s " % np.__version__
print "Pandas: %s " % pd.__version__
study_date = "2016-12-31"
We start by specifying that we will constrain our search for pairs to the a large and liquid single stock universe.
universe = Q1500US()
In addition to pricing, let's use some fundamental and industry classification data. When we look for pairs (or model anything in quantitative finance), it is generally good to have an "economic prior", as this helps mitigate overfitting. I often see Quantopian users create strategies with a fixed set of pairs that they have likely chosen by some fundamental rationale ("KO and PEP should be related becuase..."). A purely fundamental approach is a fine way to search for pairs, however breadth will likely be low. As discussed in The Foundation of Algo Success, you can maximize Sharpe by having high breadth (high number of bets). With N
stocks in the universe, there are N*(N-1)/2
pair-wise relationships. However, if we do a brute-force search over these, we will likely end up with many spurious results. As such, let's narrow down the search space in a reasonable way. In this study, I start with the following priors:
financial_health_grade
. A full description of this field is in the help docs.pipe = Pipeline(
columns= {
'Market Cap': morningstar.valuation.market_cap.latest.quantiles(5),
'Industry': morningstar.asset_classification.morningstar_industry_group_code.latest,
'Financial Health': morningstar.asset_classification.financial_health_grade.latest
},
screen=universe
)
res = run_pipeline(pipe, study_date, study_date)
res.index = res.index.droplevel(0) # drop the single date from the multi-index
print res.shape
res.head()
# remove stocks in Industry "Conglomerates"
res = res[res['Industry']!=31055]
print res.shape
# remove stocks without a Financial Health grade
res = res[res['Financial Health']!= None]
print res.shape
# replace the categorical data with numerical scores per the docs
res['Financial Health'] = res['Financial Health'].astype('object')
health_dict = {u'A': 0.1,
u'B': 0.3,
u'C': 0.7,
u'D': 0.9,
u'F': 1.0}
res = res.replace({'Financial Health': health_dict})
res.describe()
We are going to work with a daily return horizon in this strategy.
pricing = get_pricing(
symbols=res.index,
fields='close_price',
start_date=pd.Timestamp(study_date) - pd.DateOffset(months=24),
end_date=pd.Timestamp(study_date)
)
pricing.shape
returns = pricing.pct_change()
returns[symbols(['AAPL'])].plot();
# we can only work with stocks that have the full return series
returns = returns.iloc[1:,:].dropna(axis=1)
print returns.shape
Given the pricing data and the fundamental and industry/sector data, we will first classify stocks into clusters and then, within clusters, looks for strong mean-reverting pair relationships.
The first hypothesis above is that "Stocks that share loadings to common factors in the past should be related in the future". Common factors are things like sector/industry membership and widely known ranking schemes like momentum and value. We could specify the common factors a priori to well known factors, or alternatively, we could let the data speak for itself. In this post we take the latter approach. We use PCA to reduce the dimensionality of the returns data and extract the historical latent common factor loadings for each stock. For a nice visual introduction to what PCA is doing, take a look here (thanks to Gus Gordon for pointing out this site).
We will take these features, add in the fundamental features, and then use the DBSCAN
unsupervised clustering algorithm which is available in scikit-learn
. Thanks to Thomas Wiecki for pointing me to this specific clustering technique and helping with implementation. Initially I looked at using KMeans
but DBSCAN
has advantages in this use case, specifically
DBSCAN
does not cluster all stocks; it leaves out stocks which do not neatly fit into a cluster;The clustering algorithm will give us sensible candidate pairs. We will need to do some validation in the next step.
N_PRIN_COMPONENTS = 50
pca = PCA(n_components=N_PRIN_COMPONENTS)
pca.fit(returns)
pca.components_.T.shape
We have reduced data now with the first N_PRIN_COMPONENTS
principal component loadings. Let's add some fundamental values as well to make the model more robust.
X = np.hstack(
(pca.components_.T,
res['Market Cap'][returns.columns].values[:, np.newaxis],
res['Financial Health'][returns.columns].values[:, np.newaxis])
)
print X.shape
X = preprocessing.StandardScaler().fit_transform(X)
print X.shape
clf = DBSCAN(eps=1.9, min_samples=3)
print clf
clf.fit(X)
labels = clf.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "\nClusters discovered: %d" % n_clusters_
clustered = clf.labels_
# the initial dimensionality of the search was
ticker_count = len(returns.columns)
print "Total pairs possible in universe: %d " % (ticker_count*(ticker_count-1)/2)
clustered_series = pd.Series(index=returns.columns, data=clustered.flatten())
clustered_series_all = pd.Series(index=returns.columns, data=clustered.flatten())
clustered_series = clustered_series[clustered_series != -1]
CLUSTER_SIZE_LIMIT = 9999
counts = clustered_series.value_counts()
ticker_count_reduced = counts[(counts>1) & (counts<=CLUSTER_SIZE_LIMIT)]
print "Clusters formed: %d" % len(ticker_count_reduced)
print "Pairs to evaluate: %d" % (ticker_count_reduced*(ticker_count_reduced-1)).sum()
We have reduced the search space for pairs from >1mm to approximately 2,000.
We have found 11 clusters. The data are clustered in 52 dimensions. As an attempt to visualize what has happened in 2d, we can try with T-SNE. T-SNE is an algorithm for visualizing very high dimension data in 2d, created in part by Geoff Hinton. We visualize the discovered pairs to help us gain confidence that the DBSCAN
output is sensible; i.e., we want to see that T-SNE and DBSCAN both find our clusters.
X_tsne = TSNE(learning_rate=1000, perplexity=25, random_state=1337).fit_transform(X)
plt.figure(1, facecolor='white')
plt.clf()
plt.axis('off')
plt.scatter(
X_tsne[(labels!=-1), 0],
X_tsne[(labels!=-1), 1],
s=100,
alpha=0.85,
c=labels[labels!=-1],
cmap=cm.Paired
)
plt.scatter(
X_tsne[(clustered_series_all==-1).values, 0],
X_tsne[(clustered_series_all==-1).values, 1],
s=100,
alpha=0.05
)
plt.title('T-SNE of all Stocks with DBSCAN Clusters Noted');
We can also see how many stocks we found in each cluster and then visualize the normalized time series of the members of a handful of the smaller clusters.
plt.barh(
xrange(len(clustered_series.value_counts())),
clustered_series.value_counts()
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster')
plt.ylabel('Cluster Number');
To again visualize if our clustering is doing anything sensible, let's look at a few clusters (for reproducibility, keep all random state and dates the same in this notebook).
# get the number of stocks in each cluster
counts = clustered_series.value_counts()
# let's visualize some clusters
cluster_vis_list = list(counts[(counts<20) & (counts>1)].index)[::-1]
# plot a handful of the smallest clusters
for clust in cluster_vis_list[0:min(len(cluster_vis_list), 3)]:
tickers = list(clustered_series[clustered_series==clust].index)
means = np.log(pricing[tickers].mean())
data = np.log(pricing[tickers]).sub(means)
data.plot(title='Stock Time Series for Cluster %d' % clust)
We might be interested to see how a cluster looks for a particular stock. Large bank stocks share similar strict regulatory oversight and are similarly economic and interest rate sensitive. We indeed see that our clustering has found a bank stock cluster.
which_cluster = clustered_series.loc[symbols('JPM')]
clustered_series[clustered_series == which_cluster]
tickers = list(clustered_series[clustered_series==which_cluster].index)
means = np.log(pricing[tickers].mean())
data = np.log(pricing[tickers]).sub(means)
data.plot(legend=False, title="Stock Time Series for Cluster %d" % which_cluster);
Now that we have sensible clusters of common stocks, we can validate the cointegration relationships.
def find_cointegrated_pairs(data, significance=0.05):
# This function is from https://www.quantopian.com/lectures/introduction-to-pairs-trading
n = data.shape[1]
score_matrix = np.zeros((n, n))
pvalue_matrix = np.ones((n, n))
keys = data.keys()
pairs = []
for i in range(n):
for j in range(i+1, n):
S1 = data[keys[i]]
S2 = data[keys[j]]
result = coint(S1, S2)
score = result[0]
pvalue = result[1]
score_matrix[i, j] = score
pvalue_matrix[i, j] = pvalue
if pvalue < significance:
pairs.append((keys[i], keys[j]))
return score_matrix, pvalue_matrix, pairs
cluster_dict = {}
for i, which_clust in enumerate(ticker_count_reduced.index):
tickers = clustered_series[clustered_series == which_clust].index
score_matrix, pvalue_matrix, pairs = find_cointegrated_pairs(
pricing[tickers]
)
cluster_dict[which_clust] = {}
cluster_dict[which_clust]['score_matrix'] = score_matrix
cluster_dict[which_clust]['pvalue_matrix'] = pvalue_matrix
cluster_dict[which_clust]['pairs'] = pairs
pairs = []
for clust in cluster_dict.keys():
pairs.extend(cluster_dict[clust]['pairs'])
pairs
print "We found %d pairs." % len(pairs)
print "In those pairs, there are %d unique tickers." % len(np.unique(pairs))
Lastly, for the pairs we found and validated, let's visualize them in 2d space with T-SNE again.
stocks = np.unique(pairs)
X_df = pd.DataFrame(index=returns.T.index, data=X)
in_pairs_series = clustered_series.loc[stocks]
stocks = list(np.unique(pairs))
X_pairs = X_df.loc[stocks]
X_tsne = TSNE(learning_rate=50, perplexity=3, random_state=1337).fit_transform(X_pairs)
plt.figure(1, facecolor='white')
plt.clf()
plt.axis('off')
for pair in pairs:
ticker1 = pair[0].symbol
loc1 = X_pairs.index.get_loc(pair[0])
x1, y1 = X_tsne[loc1, :]
ticker2 = pair[0].symbol
loc2 = X_pairs.index.get_loc(pair[1])
x2, y2 = X_tsne[loc2, :]
plt.plot([x1, x2], [y1, y2], 'k-', alpha=0.3, c='gray');
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=220, alpha=0.9, c=[in_pairs_series.values], cmap=cm.Paired)
plt.title('T-SNE Visualization of Validated Pairs');
We have found a nice number of pairs to use in a pairs trading strategy. Note that the unique number of stocks is less than the number of pairs. This means that the same stock, e.g., AEP, is in more than one pair. This is fine, but we will need to take some special precautions in the Portfolio Construction stage to avoid excessive concentration in any one stock. Happy hunting for pairs!
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.