Jonathan Larkin
September 2017
This is the second post in a series on using Machine Learning in pairs trading. Pairs trading is perhaps the earliest form of relative value quantitative trading in equities. This series attempts to bring to bear some modern Machine Learning tools to the pair trading investment process. In the first post of this series, I used DBSCAN
clustering on latent statistical factors in price returns, the Morningstar financial_health_grade
, and each company's market capitalization to find stocks which had a high likelihood of being valid eligible pairs in a pairs trading strategy. In this post, I take a very different path driven by the following question:
Is it possible to find valid eligible pairs without using any price data at all?
If we could do that, perhaps the process would be highly robust. From first principles, why do certain stocks have highly related price series (i.e., why could valid pairs exist)? I conjecture that it is because certain stocks:
Therefore, if we could read about and understand the business of each company and then link up companies based on this understanding, we should have a robust set of potential eligible pairs. Human analysts are good at this kind of task, but can a machine do as well if not better? Well, this is a perfect task for Machine Learning, and, specifically, the sub-field of Natural Language Processing.
In this post, I:
scikit-learn
natural language processing functionality CountVectorizer
and TfidfTransformer
to "read" these descriptions and extract important and novel concept features across all companies,DBSCAN
, to find stocks that have similar profiles,WordCloud
to get some intuition on what the ML model is learning, andimport matplotlib.pyplot as plt
import numpy as np
import pandas as pd
study_date = '2017-09-06'
I have already gathered business profiles on thousands of stocks which we can bring into Quantopian with the local_csv()
Research function. For reference,
I used the pandas_finance
Python libary to query the Profile tab on Yahoo Finance. The Python code to gather the profiles is located here. If you would like to use the data file yourself, you don't need to run that notebook however. The data file is located here. To use this data in Q Research, drag and drop it into your data
directory. You must have this file in your data
directory to run this notebook.
profiles_df = local_csv('profiles_20170907.csv', symbol_column='mstr_sym')
profiles_df.index = profiles_df['mstr_sym']
profiles_df = profiles_df[['profile']]
del profiles_df.index.name
For example, let's look at KO and PEP.
print profiles_df.loc[symbols('KO')]['profile']
print profiles_df.loc[symbols('PEP')]['profile']
from quantopian.pipeline.data import Fundamentals
from quantopian.pipeline.filters import Q1500US, Q3000US
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
universe = Q1500US()
pipe = Pipeline(
screen=universe
)
res = run_pipeline(pipe, study_date, study_date)
res.index = res.index.droplevel(0)
res.head()
profiles_df = pd.merge(res, profiles_df, left_index=True, right_index=True)
print "We have %d stocks in the universe with profiles." % len(profiles_df)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import DBSCAN
Natural Language Processing (NLP) itself has a lot of sub-fields: named entity recognition, natural language understanding, machine translation, sentiment analysis, etc. For our purposes we are interested in what I think is the simplest sub-field: document clustering. Document clustering is used in apps like Google News to show you articles related to the one you are reading. We have many stock profiles and we want to group together those which are very similar.
We will use scikit-learn
classes: CountVectorizer
and TfidfTransformer
to "read" the profiles. This is the so called bag of words approach:
We give the CountVectorizer
a block of text, and we get back a word count matrix with individual words on the column axis and each document along the rows. Actually, it doesn't count words, but counts ngrams. An ngram is a text fragment. For this analysis I use complete words but also bigrams and trigrams (which are sequences of two words and three words respectively).
Let's see what CountVectorizer
does on two "documents" to extract words and bigrams:
text = [
"Quantopian inspires talented people everywhere to write investment algorithms.",
"Select authors may license their algorithms to us and get paid based on performance."
]
vectorizer = CountVectorizer(
analyzer='word', # a single ngram is a "word": characters seperated by spaces
ngram_range=(1,2) # we care about ngrams with min length 1 and max length 2
)
# Transform the text
count_mat = vectorizer.fit_transform(text)
# Let's see the counts and all the ngrams across each document
zip(count_mat.toarray()[0], count_mat.toarray()[1], vectorizer.get_feature_names())
Generally we don't care about "common" words like "the", "a", "and", "on", etc. These words are not going to indicate anything novel in the corpus. The term for these words in NLP is stop words. scikit-learn
contains a bunch of built-in stop words that we can use.
vectorizer = CountVectorizer(
analyzer='word', # a single ngram is a "word": characters separated by spaces
ngram_range=(1,2), # we care about ngrams with min lentgh 1 and max length 2
stop_words='english'
)
# Transform the text
count_mat = vectorizer.fit_transform(text)
# Let's see the counts and all the ngrams across each document
zip(count_mat.toarray()[0], count_mat.toarray()[1], vectorizer.get_feature_names())
That looks better. Now we have extracted the words and bigrams which are significant in the corpus. The only token which appears more than once in this corpus is "algorithms". Algorithms must be important to Quantopian...
Now let's run it on the company profiles.
# extract stop words so we can append our own
vect = CountVectorizer(stop_words='english')
stop_words = list(vect.get_stop_words())
# I am adding stop words which I do not expect to uniquely help determine stock similarity
# there are probably more to add
stop_words.extend(['founded', 'firm', 'company', 'llc', 'inc', 'incorporated'])
vect = CountVectorizer(
analyzer='word',
ngram_range=(1,3),
strip_accents='unicode',
max_features=3000, # we limit the generation of tokens to the top 3000
stop_words=stop_words
)
X = vect.fit_transform(profiles_df['profile'])
X
# let's see some features extracted
vect.get_feature_names()[1000:1030]
The count frequency itself isn't likely to be enough to find pairs. Why? In document classification, words that appear frequently across all documents are probably not meaningful. For example, in our case, it is likely that the word "headquartered" appears frequently. Let's see.
test_word = 'headquartered'
occurences = sum(X[:, vect.get_feature_names().index(test_word)].toarray().flatten()>0)
total = X.shape[0]
print '%d of %d profiles contain the token "%s"!' % (occurences, total, test_word)
plt.hist(X[:, vect.get_feature_names().index(test_word)].todense());
The word "headquartered" is in many profiles. Clearly this is not going to be a good word to cluster against. Enter TF-IDF, which stands for "Term-Frequency times Inverse Document-Frequency".
Using the TfidTransformer
, the indivual token count frequency per document (the "Term Frequency") is multiplied by a weighting
$$log \frac{1+n}{1+df(t)}+1$$
where $n$ is the total number of documents and $df(t)$ is the number of documents which contains the term $t$.
np.log(1+total)/(1+occurences)+1
Hmmmmm. So multiplying by ~1.0 doesn't make a difference, but the key is that all tokens are multiplied by this function. This overweights infrequently occurring tokens across documents. The idea is that words which infrequently occur across all documents are signposts of novelty for the documents within which they do occur. We can see the impact of this function as:
plt.plot((np.log(1+total)/(1+np.linspace(1,50))) + 1);
plt.xlabel("The number of documents containing term t");
plt.ylabel("Each term t in each document gets multipled by this");
This should not be confused with scaling based on the counts within documents. That feature is controlled by the flag sublinear_tf
. If this flag is true
then each count within a document is scaled by $1+log(c)$, where $c$ is the count within the document. For example, if one profile said "...Energy company...energy markets...energy demand", the count of "energy" would be scaled up in that profile. We don't use this scaling in this case.
One might ask, if this word is meaningless, why not just add it to the stop words? We don't want to do that because, although "headquartered" alone is not meaningful, maybe the phrase "headquartered in New York" is meaningful.
test_word = 'headquartered new york' # "in" is a stop word, so we ignore it
occurences = sum(X[:, vect.get_feature_names().index(test_word)].toarray().ravel()>0)
total = X.shape[0]
print '%d of %d profiles contain the token "%s"!' % (occurences, total, test_word)
# transform the count matrix
tfidf = TfidfTransformer()
X_idf = tfidf.fit_transform(X)
X_idf
As in the previous post, I use the DBSCAN
clustering algorithm. This is an ideal algorithm to cluster pairs because:
After DBSCAN
I use a second pass to only include clusters that have two stocks; i.e., we only take clusters that are tight enough to have just two stocks.
clf = DBSCAN(eps=1.05, min_samples=2)
labels = clf.fit_predict(X_idf)
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "\nTotal custers discovered: %d" % n_clusters_
clustered = labels
clustered_series = pd.Series(index=profiles_df.index, data=clustered.flatten())
clustered_series = clustered_series[clustered_series != -1]
plt.barh(
xrange(len(clustered_series.value_counts())),
clustered_series.value_counts(),
alpha=0.625
)
plt.title('Cluster Member Counts')
plt.xlabel('Stocks in Cluster');
pair_clusters = clustered_series.value_counts()[clustered_series.value_counts()<3].index.values
print pair_clusters
print "\nTotal pair clusters discovered: %d" % len(pair_clusters)
Let's look at the profile descriptions of a pair that we have discovered.
cluster = clustered_series[symbols('CDNS')]
profiles_df.iloc[clustered==cluster]
print profiles_df.iloc[clustered==cluster].iloc[0,0]
print profiles_df.iloc[clustered==cluster].iloc[1,0]
def plot_cluster(which_cluster, plot_mean=False):
pricing = get_pricing(
symbols=[profiles_df.iloc[clustered==which_cluster].index],
fields='close_price',
start_date=pd.Timestamp(study_date) - pd.DateOffset(months=24),
end_date=pd.Timestamp(study_date)
)
means = np.log(pricing).mean()
data = np.log(pricing).sub(means)
if plot_mean:
means = data.mean(axis=1).rolling(window=21).mean().shift(1) #.plot()
data.sub(means, axis=0).plot()
plt.axhline(0, lw=3, ls='--', label='mean', color='k')
plt.legend(loc=0)
else:
data.plot()
plot_cluster(cluster)
That's pretty amazing to me. Just by looking a text description of companies, 1) the ML model was able to find related companies, and 2) related companies do indeed seem to have related price series. What is the ML model learning? I generated some word clouds on the results. My code to do this is here.
The world cloud shows the text features that the ML model thinks are important; the size of a feature indicates the importance of that feature as returned by the TfidfTransformer
across all stocks. Since we are using up to three ngrams, I added a "+" to indicate terms with a multi-token feature.
I pulled out a sample of others which look to have very strong pair relationships as well.
cluster = clustered_series[symbols('JLL')]
plot_cluster(cluster)
cluster = clustered_series[symbols('SLGN')]
plot_cluster(cluster)
cluster = clustered_series[symbols('LNG')]
plot_cluster(cluster)
cluster = clustered_series[symbols('HRL')]
plot_cluster(cluster)
This post was initially intended only to be about pairs trading, narrowly defined. Thomas Wiecki pointed out to me though that the larger clusters may exhibit a tight relationship as well and thus could be used in a less restricted algorithm setting.
Above we only looked at clusters of size 2, but what about the larger ones? They could be co-integrated too. In fact, identifying larger number of stocks that are co-integrated allows us to extend the idea of pairs-trading: Rather than assuming that the price of two stocks will converge again after some time, we can instead assume that the price of any stock in the cluster, will converge back to the rolling mean price of the whole cluster.
def plot_cluster_relative_to_mean(which_cluster):
pricing = get_pricing(
symbols=[profiles_df.iloc[clustered==which_cluster].index],
fields='close_price',
start_date=pd.Timestamp(study_date) - pd.DateOffset(months=24),
end_date=pd.Timestamp(study_date)
)
means = np.log(pricing).mean()
data = np.log(pricing).sub(means)
means = data.mean(axis=1).rolling(window=21).mean().shift(1) # shift to avoid look-ahead bias
data.sub(means, axis=0).plot()
plt.axhline(0, lw=3, ls='--', label='mean', color='k')
plt.ylabel('Price divergence from group mean')
plt.legend(loc=0)
large_clusters = clustered_series.value_counts()[clustered_series.value_counts()>=3].index.values
print large_clusters
print "\nTotal large clusters discovered: %d" % len(pair_clusters)
cluster = clustered_series[symbols('DHI')]
plot_cluster_relative_to_mean(cluster)
cluster = clustered_series[symbols('LCI')]
plot_cluster_relative_to_mean(cluster)
The search space to find valid pairs for a pairs trading strategy is vast. As researchers, we can add a lot of value by intelligently reducing this search space. "Machine Learning plus data" has a lot of value to add in this search. The majority of examples I see posted across the web for Machine Learning in finance attempt to predict future stock prices by training on past prices. This is unlikely to be successful out-of-sample as the signal-to-noise ratio in price data is very low and financial time series are non-stationary. Machine Learning though can help significantly to make an investment process smarter and faster and, as this example shows, uncover relationships embedded in unstructured data. This post also demonstrates how you can bring in your own data to Quantopian Research tackle the pairs search problem.
Thanks to Thomas Wiecki for contributing to the "Larger Clusters" section and to Max Margenot for his review.
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.