Pairs Trading with Machine Learning

@Jonathan. Thank you very much for this. This is a really neat notebook to illustrate the use of ML to assist in analytics and relationship discovery. Enough of direct application of ML to market data (OHLCV). Thank you for advancing the frontier. Again.

Leo M

HI @Jonathan, this is extremely good research and I will definitely be going through each step and doing my own detailed analysis.

I have a couple of tangential questions with implementing a pairs trading strategy or a variant of it in a long short portfolio for the purpose of writing an algorithm that can fit the requirements for an allocation.

a) Does the alpha from the strategy originate from the dataset (financial_health_grade). It appeared to me the assumption that stocks of the same financial_health_grade behave similarly is made at the beginning of the process. But the clustering appears to be using the equity pricing to arrive at clusters. Will the alpha be considered to be part of dataset or the clustering algorithm. I am asking this question because there has been a push recently to find alpha in the datasets that Quantopian provides and not sure if an economic rationale can be made that the alpha was derived from the financial_health_grade directly.

b) I also have a general question with pairs trading as a strategy for Q fund allocation. Does pairs trading fit with the requirements of Q fund allocation criteria where we need to control for beta, sector risk and position concentration risk to be within certain bounds that is achieved normally through order_optimal_portfolio and underlying cvxopt. If we know which pairs we are going to trade, it might not always be possible to fit in well with the underlying algorithm in order_optimal_portfolio for stock selection with controlled risk exposures.

Regardless of my Q allocation questions, definitely seems a very interesting area to explore further.

Please advice.

Hi Jonathan -

Thanks for sharing this work. One of my goals with Quantopian is to learn some ML techniques; I'll benefit from looking over your notebook in detail.

A few initial comments:

You do the clustering, followed by the pairs analysis. But how does this compare with finding pairs by randomly picking pairs from the entire universe and testing them? In other words, if I am willing to wait N minutes/hours/days to find M pairs, how much benefit does the clustering provide over a brute force technique? And would I expect the trade-ability to be better for pairs from the clustering technique, over brute force?
Given that the same stock is in more than one pair, this perhaps suggests that some kind of "basket" trading (versus "pair" trading) approach might be more effective. Qualitatively, maybe there is a multi-dimensional expanding and shrinking of a glob of stocks, versus a collection of discrete two-stock tangos?
I posted this example, which may be of interest.
Generally, I'd think that Quantopian would be ripe for generating various point-in-time universes derived via ML techniques, that would be made available to the crowd. You are kinda-sorta doing this with the Alpha Vertex PreCog data sets, but presumably have no control over the ML details. For the present example you've shared, if it turns out to be a viable technique, one could imagine a Pipeline-compatible universe of pairs, updated nightly, with the underlying code made available to users, so that there would be complete transparency.

I can address a few of these points.

Just looking through all pairs will yield a massive amount of multiple comparisons bias https://www.quantopian.com/lectures/p-hacking-and-multiple-comparisons-bias, which would render the results not useful. The advantage of a clustering technique is it lets you do a first pass that can suggest related baskets, effectively doing dimensionality reduction on features. Within this new smaller set you can pick likely candidates and do out of sample testing.
Absolutely, the Johansen test (unfortunately not available in Python) checks for N-way cointegration. https://en.wikipedia.org/wiki/Johansen_test The intuition here is that cointegration just means that some linear combination of time series sums to a stationary time series. This stationary series is mean reverting and can be traded as such. To get more intuition see here https://www.quantopian.com/lectures/integration-cointegration-and-stationarity. I don't know any examples that trade baskets of cointegrated stocks, but it's certainly feasible.

Disclaimer

@Anthony, To your point, I think there are many applications of ML which can be valuable to an investment process beyond trying to fit and predict prices/returns directly. This is just one example. Signal combination, as described in this popular series is another great application.

@Leo,

a) Does the alpha from the strategy originate from the dataset (financial_health_grade). It appeared to me the assumption that stocks of the same financial_health_grade behave similarly is made at the beginning of the process. But the clustering appears to be using the equity pricing to arrive at clusters. Will the alpha be considered to be part of dataset or the clustering algorithm. I am asking this question because there has been a push recently to find alpha in the datasets that Quantopian provides and not sure if an economic rationale can be made that the alpha was derived from the financial_health_grade directly,

My analysis does not purport to say anything about any alpha embedded in the financial_health_grade. There very well may be some alpha there, but this analysis does not require that. I'm simply saying that stocks with similar health grades should have similar economic reactions and stock price behavior. I use equity prices, translated to returns and then reduced to 50 PCA loadings, to do the clustering along with market cap and financial_health_grade. The feature matrix for the clustering has 52 columns: 50 are the PCA loadings, one is the financial_health_grade, and one is market cap. A completely separate analysis could be done (via alphalens) to see if the financial_health_grade is in fact an alpha factor, but, again, that's unrelated to my post.

b) I also have a general question with pairs trading as a strategy for Q fund allocation. Does pairs trading fit with the requirements of Q fund allocation criteria where we need to control for beta, sector risk and position concentration risk to be within certain bounds that is achieved normally through order_optimal_portfolio and underlying cvxopt. If we know which pairs we are going to trade, it might not always be possible to fit in well with the underlying algorithm in order_optimal_portfolio for stock selection with controlled risk exposures.

Yes! A pairs algo is very well suited to a Q fund allocation. One feature of pairs algos which is very nice is that the long and short sides are naturally balanced with self-similar assets. Thus a pairs book, often without any additional massaging, is naturally beta and sector neutral. The challenge with pairs algos is often coming up with enough valid pairs and, relatedly, achieving consistent capital usage. There is a sample algo here which uses order_optimal_portfolio within a pairs algo.

@Grant,

I concur with Delaney's response. Regarding the Johansen test, I noticed this lonely post which points to a Python implementation you can copy/paste. I have not tried it.

Disclaimer

Leo M

@Jonathan, thanks for explaining in detail and providing useful pointers including the order_optimal_portfolio changes that are needed to go along with paris trading strategy. I see now how the pairs trading strategy itself can reduce the risk exposures and why we don't need order_optimal_portfolio to do that for us in this strategy.

I had stayed away from what would appear to be pure price based strategies before because I had developed some misconceptions that Mean reversion type algorithms wree not being favored in the Q allocation after reading some forum posts, thanks for clarifying the situation.

This is a very interesting topic to explore and I would definitely be spending some time in this area soon expanding upon your excellent research work.

Regards,
Leo

luc prieur

Jonathan,

I don't get the PCA decomposition that you use:

Your code:
N_PRIN_COMPONENTS = 50
pca = PCA(n_components=N_PRIN_COMPONENTS)
pca.fit(returns)

Then, you use:
pca.components_.T as the new data.

I would have done:
X = np.hstack(
(PCA(n_components=N_PRIN_COMPONENTS).fit_transform(returns.T),
res['Market Cap'][returns.columns].values[:, np.newaxis],
res['Financial Health'][returns.columns].values[:, np.newaxis])
)

Essentially, you have 504 data samples with some 1400 features that you want to reduce to 50. i.e. 504 samples of 50 features. The fit_transform finds the new axes and transforms the data. You just use the axes.

Thanks,

Luc

Regarding the multiple comparisons bias, I don't yet have an intuitive feel for it. Say I have a box of 1000 toothpicks and I want to find pairs that have nearly the same weight and length, to within specified tolerances. I have a weighing scale and a measurement microscope. I roll up my sleeves and start making random pairwise comparisons. After 1000*999/2 = 499,500 comparisons, I'm done, and have a list of pairs of toothpicks (I forgot to mention, prior to my evaluation, I had each toothpick laser-engraved with a serial number). So where does the bias come in?

I guess the idea here is that if I have some additional information about the toothpicks, aside from weight and length, then I might be able to use it to improve my assessment. For example, if the toothpicks are colored (for the holidays), red and green, then I might hypothesize that the red toothpicks were made in a red toothpick factory, and the green ones in separate factory. In this case, clustering by color might help improve my assessment, and reduce the number of spurious pairs. But then, if the toothpicks are all made in the same factory, and the coloring process does not affect the weight and/or length differently by color, then I could be claiming I'd discovered an improved technique for toothpick pairing, but done nothing substantive.

For Jonathan's analysis, I'm thinking that the hypothesis is something like "All of this fancy analysis works, and reduces the number of spurious cointegrated pairs, over a brute-force approach." So, it would seem that one would actually need to do the brute-force analysis, and then figure out how to compare it to the proposed technique, to determine if it is beneficial. It may be "the bomb" but to me the advantage is unclear (other than perhaps reducing the problem down to being more computationally tractable with the resources available).

A pairs algo is very well suited to a Q fund allocation.

Intuitively, like other well-known quant techniques using readily available data (e.g. price mean reversion), I'd think that there would be pretty slim pickings out there. My assumption is that for decades now, hedge funds have had big honkin' computers churning away at the problem, squeezing out the alpha from pairs trading. I don't have any industry experience, but it doesn't seem like Quantopian would have any edge in this area. That said, perhaps as an incremental alpha factor in the multi-factor grand scheme described in Jonathan's blog post, one could roll it into the mix, just for yucks. Would it be feasible to formulate pairs trading as a general pipeline alpha factor?

As a general comment, I'd concur that the messaging on what is most likely to get an allocation is muddled. On the one hand, we have the directive to use "alternate data sets" and on the other, guidance that pairs trading, potentially only using company fundamentals and price-volume data would be attractive. And also guidance that multi-factor algos are desirable (but perhaps only if all of the factors are based on alternate data sets?). Personally, I don't want to spend the next 6 months developing a pairs trading algo (or alpha factor, if that is feasible), and then another 6 months paper trading it, only to hear that my odds of getting an allocation are slim due to the strategic intent not matching the requirements (this is the message I think I got regarding price mean reversion strategies and possibly even price mean reversion alpha factors...I'm not sure). In some sense, an advantage Quantopian has over traditional hedge funds is that they could ignore the strategic intent altogether, and just base their assessments and allocations on the black-box algo performance. This would eliminate the risk of herd mentality or personal biases on the part of the the fund team. Just go with the data.

Blue Seahawk

The notebook is great. I think everyone understands that the sample algo here (lecture 24) referenced in passing was merely provided as a starting point for pairs trading back then, I'll just temper it a little bit, not to over-dissect, but since I took a look, then for those who might travel that path, couple of things to know... If cloning that, the end date is automatically adjusted forward for today and so it isn't a surprise that since it originated quite awhile ago, 3 of its 4 stocks have delisted, so these comments apply only running to the end of 2015 before the delists (before the first in Feb 2016). It does hit a leverage of 1.36 at one point and as always that is something for us to keep an eye on. Its FixedSlippage results in no partial fills which can be useful for testing sometimes, from Help: naive use of fixed slippage models will lead to unrealistic fills, particularly with large orders and/or illiquid securities. Using the same FixedSlippage, returns are nearly identical at 1M vs 10M although capital usage becomes just 9%. So the slippage line can be commented for default slippage and then at 1M there are an average of [integer edited later because there may have been a bug in the count at this time] minutes of partial fills per stock, and the effect takes returns from +29 to -16 and lowers max leverage to .52. Both of the ABGB and FSLR pair took losses. Not a big deal since it is just a starting point to build upon, there's also the other pairs-trading algo (more basic, lecture 23) in the lecture series worth knowing about.

@luc,

Essentially, you have 504 data samples with some 1400 features that you want to reduce to 50. i.e. 504 samples of 50 features. The fit_transform finds the new axes and transforms the data. You just use the axes.

It is intentional that I don't do the transform of the data. The financial interpretation is that the pca.components_.T are per stock betas to hidden statistical factors. When you go through with fit_transform(...) you get the time series of each hidden factor. I don't want those. I simply want to cluster on the betas.

@Grant, Olive Coyote,

I don't agree that this work is for the sole benefit of computation time. The key difference between finding pairs and your toothpick analogy is that the result of your toothpick search will be the ground truth but this is not so in statistical work. The shortest toothpick today doesn't stop being the shortest toothpick tomorrow. A brute force pairs search will surely come up with many spurious results. To protect against data mining you should condition your search.

Disclaimer

@Grant,

I posted this example, which may be of interest.

That example is based on an example in the scikit-learn docs here. If you think about this work as a three step process: 1) define features, 2) cluster, 3) visualize, I chose 1) PCA + financial_health_grade + market cap, 2) DBSCAN, 3) T-SNE; the example cited chose 1) covariance.GraphLassoCV on intraday price changes, 2) Affinity Propagation, and 3) Locally Linear Embedding.

I had reproduced this in Quantopian Research awhile back. We don't expose some of the fancy matplotlib classes and methods he uses, so my actual visualization code is a little different.

It's fun to see how many different ways one can cluster stocks.

Disclaimer

@ Jonathan -

Thanks for the additional example.

For the toothpicks, there is a measurement error of the weight and length, so just picking at random will result in some pairings that aren't correct. But if clustering by color results in more uniformity in each cluster, then it should help. For example, if the differences were slight and were not well-resolved by the measurements, then the color factor could provide a boost. At least that's my intuition, but I need to sleep on it, and read a bit more about multiple comparisons bias (if that is actually what is at play here).

I'm not saying that the clustering, etc. doesn't help for stocks, but just that if you don't actually compare it to brute force, then there is no way to tell. So if you are trying to show the benefit over brute force, you'll need to do a bit more work.

Guy Fleury

After reading this notebook, I needed only one experiment to ascertain if there was something in this clustering of correlated stocks. Since the notebook is looking at one stock universe from its leet (1337) point of view, I opted to simply change its random seed, without changing anything else. This way, all that is changed is this peeping hole into the past.

Here are the outcome of the 4 tests:

The clustering which should have been only slightly differentiable, is all over the place. The point being that out of the gazillion possibilities, gazillions were possible. And, therefore, whatever trading strategy that might be constructed from any of those perceptions might not carry that well going forward.

My way of saying: I do not see anything there. Or, am I missing on something.

@Guy,

T-SNE is a visualization step; it does not affect the formation of the clusters. T-SNE is meant to visually validate that the clusters are distinct. To me, your experiment validates the soundness of these clusters, not the other way around. With the various random state parameters you chose, in all cases, the clusters remain intact and clearly separable in space. The individual cluster shapes are in many cases even the same (just rotated in space). The location of the clusters in the plane is not important. The random state parameter serves to simply inform the artistic rendering of the plot.

Disclaimer

@Grant

I think you bring up a great example of one of the fundamental differences between systems studied in engineering and systems studied in finance. Generally in engineering the past is representative of the future. A toothpick's weight will not change over time (by much at least), and other properties are also the same over time. We refer to this more generally as stationarity. In engineering systems studied are similar, laws of physics hold over time. A bridge built one way that stands up now will likely stand up tomorrow. The opposite is true in finance and many other more volatile and less stationary systems.

The implicit assumption in many engineering approaches is that the past implies the future. In finance that assumption is not a given and must in itself be checked.

Consider if the weights of the toothpicks were changing over time and drawn from some unknown stochastic process. This is effectively the case here. When non-stationarity exists, you cannot just assume past implies future. You need to understand more of the system and effectively model what is going on. At the end of the day you do want to produce stationarity in your residuals (model error), as this means you have explained any non-stationarity in the system. If you just searched through your data from yesterday's toothpicks, there may be no relationship to tomorrows toothpicks.

What the clustering is doing is searching for toothpicks that have properties which we believe are more likely to make them similar, effectively the driving factors of the relationship. As far as multiple comparisons, I recommend you actually go through the lecture and get your hands dirty. There are real experiments in there that you can muck with and see for yourself if you don't believe me.

Disclaimer

Hi Delaney -

I know about stationarity. In manufacturing, it is a key concept in statistical process control (e.g. see the Shewhart control chart described on the nice NIST site). In trading, Bollinger bands are basically control charts with a bet on stationarity.

As far as I can tell, the multiple comparisons problem has nothing to do with stationarity. It is a general pitfall associated with making repeated statistical tests and finding an increasing number of false positives with the number of tests. In the case of toothpick weight, suppose my question is "Do I have a bunch of toothpicks in excess of 101 g?" and in my box of 1000 toothpicks, there are none above 100 g (but a lot right at 100 g). If I use a cheapo "noisy" weight scale, the probability of answering incorrectly goes up, with the number of toothpicks I measure (asymptotically approaching an answer of "yes" when the correct answer is "no"). The solution to the problem is to get a better weight scale. In the case of stock pairs, if the question is "Do I have a bunch of stock pairs that I can trade profitably?" and I search brute force with a marginal technique, then I'll answer "yes" (and lose money) when the correct answer is "no" (and not lose money). The brute force approach for finding stock pairs would work if one had a better test.

The issue I see here is that the analysis needs to test for both cointegration and its stationarity. The latter is what really matters for trading, so to show that clustering is an improvement over brute-force pair selection, one would need to look at the persistence of the conintegration of each pair versus time. If I'm thinking about this correctly, one really needs to find techniques that spit out pairs with persistent cointegration. Presumably clustering helps, but it is still an open question.

One suggestion is to consider if using minutely data would help. In science/engineering, one common technique to improve the signal to noise ratio (SNR) is to increase the sampling rate, smooth, and then down-sample (for data storage, computations, etc). For example, rather than using daily closing prices, estimate the daily VWAP from the minute bars, and use it instead (unless there is something special about the last trade price of the day). The "error bars" will be much smaller.

Also, analyses can take into account the error bars. For example, for x-y data that have unequal error bars, one can do a straight-line fit, applying less weight to points with larger error bars. Is there a way to account for error bars when searching for pairs?

Guy Fleury

@Jonathan, even if I can see the artistry of the cluster separation, it does not mean I can extract worthwhile profits when they move from quadrant to quadrant shifting their center of mass, their axis, and size. It is like hitting a moving target that can change shape and direction from second to second, almost like shooting in the dark. An illusion: oops, it was there, and now it's not. Where did it go? Oops, it moved again.

To me, you are more providing a do not do this in your trading strategy since overall profit might be more than illusive. For instance, what kind of pair trading strategy could even extract a reasonable profit from cluster 4? I do not see how anyone could ever extract more than 1% over those 2 years, no matter the nature of his/her pair trading strategy.

There might be something to do playing cluster against cluster, but they too are moving all over the place. They positively correlate one second, and the reverse the next. The image I get from peeping at that past data is more than just a blur.

All I see from cluster 4 is this: take anyone of those 3 stocks, they were about equivalent in the past. If someone forced the inclusion of one of those 3 stocks in another kind of simulation, they might not see much of a difference whichever stock was used.

Has anyone on Quantopian posted a cloneable pair trading strategy using these clustering methods? Let me see how productive they would have been or could be?

Hi Grant,

You said something which is exactly right "one would need to look at the persistence of the conintegration of each pair versus time. If I'm thinking about this correctly, one really needs to find techniques that spit out pairs with persistent cointegration." That's the core of this, we're looking for cointegration that persists.

The noisy scale analogy works as well, just keep in mind that it can not only incorrectly categorize 99g picks as 101g, but it could also categorize 102g as 97g. Getting a more accurate scale/test is absolutely the solution. That is the subject of a lot of research in statistics, and generally the only real way is to increase sample size and check underlying assumptions. Otherwise I think you're on the right track here. The other thing about clustering is just that we're hoping it will increase your likelihood of finding pairs. If you brute forced it you might end up with way too many pair candidates to wade through. There are many ways to find pairs, this is just one potential technique to be investigated.

Disclaimer

Thanks Delaney -

I'm trying to picture what this whole thing might look like, using the Q research platform, to sort out if it at all makes sense to then pursue an algo. It seems like we need some sort of rolling analysis, going back as far as we can (e.g. 2002). I guess every day, one would find a new set of clusters, and then find the pairs in those clusters? And then find pairs that persist long enough to make money on them? And then sort out a model that would reliably identify pairs that will persist long enough to make money on them (since on any given day, we'd want to know how to do a drop/add of our universe of pairs of stocks)?

I'm also wondering how to deal with the fact that we don't really have discrete pairs, but baskets of stocks (since some stocks form pairs with more than a single other stock)?

And I guess we'd want to have the pairs trading strategy as an alpha factor in a multi-factor algo, not as a single-factor algo, right?

The whole thing feels kinda gnarly at this point...whew! Or am I making it harder than it needs to be?

What's the game plan on your end? Are you going to work this through to the end, as a public domain effort? Or are you looking for the crowd to carry it through?

https://www.diva-portal.org/smash/get/diva2:818972/FULLTEXT01.pdf

Using the white/grey/black edge lingo of the book "Black Edge," pairs trading of this sort would be firmly in the white edge category. The information is readily accessible (although not necessarily free) and the analysis tools (both software and hardware) are commercial off-the-shelf (again, not necessarily free, but we aren't talking about nation-state class supercomputers coded by a team of wizards here). For a universe of stocks comparable to the Q1500US, it would be interesting to see what kind of "alpha decay" pairs trading has undergone since, say, 1990 (or even back to 2002, using the Q data). If the trend is asymptotically approaching zero alpha, then it might be wise to move on to something else.

Another angle is that although reportedly the hedge fund world is highly secretive, I suspect not so much. I gather that Quantopian is now pretty well plugged into the hedge fund scene. What's the word on the street regarding the potential viability of equity pairs trading, using Quantopian in its present form? I don't work in the field, and so I have no idea if this is more of an illustration, a pure academic exercise, or a short-putt to a Q fund allocation (I sorta doubt the latter, since if it were, then it wouldn't be put in the public domain). I have no idea if this is akin to "Hey, check this out: we have a new-fangled contraption called the automobile powered by an internal combustion engine. Just pour in some gasoline and away she goes!" More background and justification would be helpful. I'm happy to donate some time to Quantopian, but it is a lot more fun if, intuitively, I have a sense that I'm vectored in the right direction.

On a separate note, I'd be careful not to dismiss the brute force approach out of hand due to the computational effort. For the Q1500US, there are 1,124,250 pairs to analyze. On the surface, it doesn't sound so bad, since the computation can be parallelized to generate the base list of pairs versus time. There are about 2x10^6 pixels in a 1080p HDTV, so it is like one HDTV frame per trading day--back-of-the-envelope, data storage and transmission should be no problemo. What am I missing? I guess where I'm heading is that Q could put out a database of all the pairs, back to 2002, and then users could try to understand how to filter out the spurious pairs (the implication of Jonathan's post is that they aren't all spurious; it is just a matter of developing a model to find the trade-able pairs point-in-time).

Here's a potentially relevant paper:

James Jack

Grant,

In general you need a hypothesis to start off with, as to why two time-series would affect one another, otherwise you select pairs which have no relationship at all ( Delaney's link to lecture 16 is effective ).

For example, if I give you the following data for a time-series...

9
6.5
6.7
6.9
13.5
16.4
20.3
27
24.3
19.7
17
10.4
10.2
10
10.6
14.1
16.1
18
22.1
25.8
21.7
21.5
17.6
12.5
9.2
8.8
8
11.6

...how confident are you that the trend will repeat?

(The above is a real time-series and has a real answer.)

Well, my hypothesis could be that the Q1500US has no spurious pairs; any pair found by brute force will be profitable. Personally, I don't know if the hypothesis is true or false (but it should be easy to show; perhaps Delaney shows this in his notebook, if I can wade through all of the Python and stats). My point is that one way to go at the problem is to pre-compute all of the pairs, and then try to sort out a means to filter out the spurious ones. In some sense, that's what Jonathan is doing. Any pairs not within clusters would be filtered out. There is a multitude of filters that could be applied, so having the entire universe of point-in-time candidate pairs to filter on could be useful (it would seem to be a nice fit for Pipeline). Then testing simple hypotheses would be easy (e.g. "Filtering out all pairs with tickers not starting with the same letter of the alphabet will result in a profitable portfolio of remaining pairs").

Regarding the time series you shared, I'll bet $5 US it will repeat. How long do I have to wait for the answer?

"Horses are highly social herd animals that prefer to live in a group. There also is a linear dominance hierarchy in any herd. They will establish a "pecking order" for the purpose of determining which herd member directs the behavior of others, eats and drinks first, and so on."

Ref..: https://en.wikipedia.org/wiki/Horse_behavior

James Jack

Grant,

So before knowing any more, you were willing to bet $5.

In fact the time-series is the monthly average temperature recorded in London in degrees C, starting from December 2012.

With that new knowledge, how much would you bet now? A lot more I would assume. Of course this scenario is academic but the point remains: we want a structural reason for a pattern to continue because, without it, you can invest very little.

A fun example would be to take a year of temperatures following that period and do some stats. What if I gave you a p-value of 0.06 that the following year's data came from the same population? Would you assume the pattern had ended?

Taking only the pairs which satisfy the statistical mumbo jumbo (from your total 1,124,250 pairs) will definitely contain spurious results. Your argument is then to filter them down more...

...one option is to take each pair and decide whether there is any reason why two entities should be so related. That would take a while.

...another option is to wait 2 years, and use the out-of-sample data to "filter your filter". Some of those which had real relationships will be filtered out (false negatives) and some of those with no relationship will continue (false positives). That's the same problem you had before.

Guy Fleury

@Grant, I agree with your point of view. Let me clarify mine.

Let's take 1,000 Q members running Jonathan's notebook. They all, only replace the leet with none (no random seed), or each take a different seed. This would make the pinhole from which they will look at this Q1500US ball of variance unique. The outcome would be related to the very microsecond they launch the pairing clustering. What would be the answer?

Everyone of them would get a different pairing chart. They might get some look alike clusters, which is cute, but these clusters would appear in different quadrants, having positive and negative inter-correlations, spread more or less around their erratic center of mass.

Technically, you would have 1,000 charts like the ones I've posted. Some in the group, if they apply some pair trading strategy to their found clusters, might win something, not much, but still win. However, we would have no way of saying: members 13, 37, and 42 will be the best, or even worse, that they will win at all.

And if those 1,000 members did the test again, they each would get a different answer. What would their trading strategies do with that? Designing a trading strategy using clustered pairing that could be profitable would be a daunting task. However, for anyone succeeding over some 100 simulations, meaning profiting, might have a robust system even if profits are low.

The tool, nonetheless, might be useful in detecting price movement similarities. A way of finding substitute replacement candidates for your portfolio, if you ever needed some. But then again, why select another poorly behaving stock to replace the one you had if you could go right out and simple select a better one with maybe no relation to the one you had or its cluster?

The cluster pairing might have its use for someone wishing stable returns, even if below market averages. I think I will be looking for greener pastures.

I found Jonathan's notebook very interesting.

https://web.stanford.edu/class/msande448/2017/Final/Presentations/gr6.pdf
https://web.stanford.edu/class/msande448/2017/Final/Reports/gr6.pdf

Another random thought is that one could do the fancy ML either as Jonathan has shown and then find the pairs or first trim down the Q1500US to contain only stocks that belong to at least one pair (or only one pair?), and then run the ML, and pair identification, per Jonathan's recipe.

It seems like this would give the ML a better shot since irrelevant stocks would not be adding noise/confusion.

Alan Coppola

Sep 4, 2017

First, @Jonathan, thanks for the instructive notebook. I appreciate the effort and know that ita always takes a few long days to complete something like what you did!

Second, I've been trying to learn&apply some stat arb, that uses pairs or copula trading, and came upon this interesting final report from a recent class project. Any comments about how one would apply the clustering used in this post, using ideas in the linked-to paper below would be appreciated.

Third, there were some Q forum posts relating to cointegration and pairs trading, both notebook and algo, in the past that would seem useful to create an algo. I'm looking at them for ideas on how to get pairs that are in a productive mean-reversion state. Any help here appreciated!

https://www.quantopian.com/posts/pipeline-pair-cointegration-notebook-3700-cointegrated-pairs

Personally, I have been having trouble with time-outs wrt the limits Q sets on when trying to compute factors based on the above concepts.
I end up having to put timers all over the place and adjust the problem size parameters based on how much time is allotted for that run.

Ashish Agarwal

Sep 4, 2017

Great post @Jonathan. How did you come up with the number of 1.9 in DBSCAN(eps=1.9, min_samples=3) . Eps is an important parameter in DBSCAN and can influence number of pairs. Was it a random guess ? I see it creates maximum number of pairs at 1.9

luc prieur

Sep 5, 2017

I have modified Jonathan's notebook to use monthly returns of stocks as clustering features instead of a PCA decomposition of the daily returns. My assumption is that co-integrated stocks move in similar fashion over longer periods of time. So instead of reducing 504 day returns to 50 components, I directly use 24 months of returns as features. The results seem correct to me.

Comments please.

Sep 5, 2017

@Olive,

It's a big stretch, in my opinion, to say that this (large) number of principal components has any underlying economic meaning. Maybe the first 3 can proxy for latent factors -- the yield curve can be decomposed into parallel shifts, tilts, and twists, using PCA, for example -- but 50?
This still feels like a data mining exercise, with the main benefit being computational efficiency.

I made an affirmative choice to use 50 factors. Commercial risk models for the US equity market typically have between 20 and 75 factors. For example, the Barra US Total Market Model has 75 factors (all with direct economic meaning); the Northfield US Short Term Equity Risk Model has 20 factors (purely statistical). The US treasury yield curve or major swap curves can be well explained by a very small number of factors, but the equity market is much more complex.

@Guy,

...even if I can see the artistry of the cluster separation, it does not mean I can extract worthwhile profits when they move from quadrant to quadrant...

...replace the leet with none (no random seed), or each take a different seed. This would make the pinhole from which they will look at this Q1500US ball of variance unique. The outcome would be related to the very microsecond they launch the pairing clustering...

The clusters are deterministic and do not move. They are chosen by the DBSCAN algorithm which is deterministic. You are conflating the clustering with the visualization.

@Ashish,

How did you come up with the number of 1.9 in DBSCAN(eps=1.9, min_samples=3) .

I chose DBSCAN because, unlike KMeans, Agglomerative Clustering, etc., you do not need to specify the number of clusters and not all samples get clustered. The latter feature is attractive to me because it makes sense to me that not every stock will be so closely related to at least one other stock that a viable pair relationship would sustain. It is likely that most stocks are "noise" with respect to this analysis and DBSCAN handles that well. I chose 3 as the min_samples parameter because I wanted to be able to find small clusters. In this application, it seems reasonable to me that tightly related stocks would exist in small clusters. Per the docs, "Epsilon is a distance value, so you can survey the distribution of distances in your dataset to attempt to get an idea of where it should lie." That's essentially what I did -- I did not tune the parameter programmatically; rather I chose a few values and settled on this one once I found a setting that produced a handful of tight clusters. The result might be improved with a different approach. I was careful though not to spend a lot of time on tuning this parameter.

@ Grant,

On a separate note, I'd be careful not to dismiss the brute force approach out of hand due to the computational effort.

This example workflow is not meant to reduce the computational effort; rather it is meant to mitigate data mining and spurious results.

Disclaimer

Sep 6, 2017

@ Jonathan, Delaney -

There's a lot for my little pea brain to take in here, but there appears to be a straightforward recipe for avoiding this multiple comparisons pitfall, call the Bonferroni correction. You are using a fixed p-value:

find_cointegrated_pairs(data, significance=0.05)

Unless there is a reason to think that the clustering has completely eliminated the multiple comparisons bias, shouldn't the p-value be divided by the number of candidate pairs within each cluster?

Trying the Bonferroni correction should be a minor tweak to your code:

def find_cointegrated_pairs(data, significance=0.05):  
    # This function is from https://www.quantopian.com/lectures/introduction-to-pairs-trading  
    n = data.shape[1]  
    score_matrix = np.zeros((n, n))  
    pvalue_matrix = np.ones((n, n))  
    keys = data.keys()  
    pairs = []  
    for i in range(n):  
        for j in range(i+1, n):  
            S1 = data[keys[i]]  
            S2 = data[keys[j]]  
            result = coint(S1, S2)  
            score = result[0]  
            pvalue = result[1]  
            score_matrix[i, j] = score  
            pvalue_matrix[i, j] = pvalue  
            if pvalue < significance:  
                pairs.append((keys[i], keys[j]))  
    return score_matrix, pvalue_matrix, pairs

What needs to change?

As is stated here, "The correction comes at the cost of increasing the probability of producing false negatives, i.e., reducing statistical power." Elsewhere, I've read that for a large number of comparisons (e.g. brute-force pairs search), the p-value ends up so small that the correction tends to reduce the statistical power to zero. But here we have a more modest number of pairs, and your hypothesis is that some of them should be the real deal, so maybe the Bonferroni correction would be appropriate (perhaps multiplying the resulting p-value by a hyperparameter scale value, if too few pairs result). Generally, I'd think that if the clusters are of unequal size, then you shouldn't be using the same p-value for each (since the risk of multiple comparison bias goes up with cluster size).

it is meant to mitigate data mining and spurious results

If that is the premise, then you'll need to compare the brute-force method to this one. If both methods yield zilch in terms of profitability (or if the new method provides negligible benefit), then no progress has been made. Since you have data going back to 2002, you can make the comparison without waiting for out-of-sample data to roll in (if you are careful in your methodology). My basic point is that if you are wanting to show that the proposed method is better, you have to say better than what, and quantify the improvement.

Sep 6, 2017

Bonferroni is a great way to correct your p-values when running a smaller number of tests. It tends to be over-conservative when running many tests and can quickly reduce a process to one that is way too under-sensitive and never triggers.

If you were to run it on 1M pairs you'd be dividing the p-values by 1M, which would result in an incredibly low chance a good pair ever made it through. This is the fundamental problem with multiple comparisons. Generally it's best to intelligently reduce the number of tests that are being done.

Disclaimer

Sep 7, 2017

Generally it's best to intelligently reduce the number of tests that are being done.

O.K. Sounds good. But if the number of tests is more than one, then a correction for multiple comparisons would need to be applied, right?. So, I'm wondering if the p = 0.05 used above is too high?

Sep 8, 2017

Here's my attempt to apply the Bonferroni correction to Jonathan's notebook, accounting for the within-cluster multiple comparisons bias, at a significance of p = 0.05. Did I apply it correctly?

The result, as one might expect, is that we end up with fewer pairs--only two! And interestingly, there are 4 unique tickers. All 4 are in the energy business, so I guess at some level, it makes sense. Is there something unique about the energy business that would tend to result in statistically significant pairs (over other businesses)?

Sep 8, 2017

We have to distinguish between exploratory analysis and definitive testing. The analysis above is exploratory and intended to suggest hypotheses. As such the certainty doesn't have to be as rigorous. After you had a specific set of pairs you had validated, then you would want to definitively test using all the proper correction factors.

In exploratory analysis we face higher risk of the test saying no to a good pair, so we are okay with it being overly sensitive.
In definitive testing we face risk of trading something not really a pair, so we want to make sure the test is very specific and apply all correction factors.

Disclaimer

Sep 9, 2017

Hi Delaney -

It is potentially an incremental improvement to Jonathan's code:

def find_cointegrated_pairs(data, significance=0.05):  
    # This function is from https://www.quantopian.com/lectures/introduction-to-pairs-trading  
    n = data.shape[1]  
    c = n*(n-1)/2 # Bonferroni correction  
    score_matrix = np.zeros((n, n))  
    pvalue_matrix = np.ones((n, n))  
    keys = data.keys()  
    pairs = []  
    for i in range(n):  
        for j in range(i+1, n):  
            S1 = data[keys[i]]  
            S2 = data[keys[j]]  
            result = coint(S1, S2)  
            score = result[0]  
            pvalue = result[1]  
            score_matrix[i, j] = score  
            pvalue_matrix[i, j] = pvalue  
            if pvalue < significance/c:  
                pairs.append((keys[i], keys[j]))  
    return score_matrix, pvalue_matrix, pairs

If I've done it correctly, the Bonferroni correction is applied to each cluster. To get more pairs, one can simply increase the p-value. For example, I've increased it to p = 0.2 in the attached notebook, and now get 6 pairs:

[(Equity(1665 [CMS]), Equity(21964 [XEL])), (Equity(5792 [PCG]), Equity(18584 [LNT])),
(Equity(1023 [BOH]), Equity(11215 [UMBF])),
(Equity(3675 [EQC]), Equity(11478 [FR])),
(Equity(3675 [EQC]), Equity(33026 [DCT])),
(Equity(3675 [EQC]), Equity(39204 [PDM]))]

Generally, my point is that the pair search algorithm should account for cluster size, since if I understand this multiple comparisons jazz correctly, the probability of spitting out spurious pairs goes up with cluster size.

I'm still getting up the learning curve, but I gather that one would run so-called ROC curves, to compare various approaches (e.g. brute force, clustering with no Bonferroni correction, clustering with Bonferroni correction, etc.). Basically, the notebook is spitting out pairs that are hypothetically trade-able. So, with a way to determine the true trade-ability of the pairs, one could do a set of ROC curves, to compare the power of detecting pairs. How would one score the trade-ability of a pair?

Feb 9, 2018

Said -

Note that Jonathan is no longer with Quantopian (although I guess he could still be active as a user...haven't seen any activity since he left, however).

Esteban Zuluaga

May 3, 2018

I didn't quite get why would Jonathan apply a PCA to the returns series. Isn't this wrapping too much valuable information, reducing largely the significance of the sample? I mean, the sample is large at the transversal scale, but really small on time (quantity of returns involved in the analysis).

John Strong

May 18, 2018

What a pity Jonathan Larkin is no longer around to field questions! I really regret missing that window of opportunity.

I find it fascinating that he chose to use the Principal Components as factors!! But it is also tantalizing, because I only have conjectures as to why this might work and the author of the algo is not here to explain.

My conjecture: a Principle Component eigenvector can be thought of as a vector of weights that indicate how much each of the original features contributes to the principal component, so I'm guessing that the eigenvector identifies which stocks covary along its axis in the reduced dimension space. This is because the eigenvectors are aligned along the axes of greatest variance, and so, presumably, two stocks covarying along the new axis would both have large weights for the features that most contribute to defining the new axis. Is my conjecture remotely correct? Do the principal components make good factors, because they identify covarying stocks? Of course, covariance is not enough. We must also test for cointegration, but stocks that covary a lot along a principal component axis might make good candidates for a cointegration test. Am I hot, cold or lukewarm?

The following sentences by Jonathan Larkin were extremely rich. He said:

The financial interpretation is that the pca.components_.T are per stock betas to hidden statistical factors. When you go through with fit_transform you get the time series of each hidden factor. I don't want those. I simply want to cluster on the betas.

Mr. Delaney MacKenzie, could you possibly be persuaded to unpack this statement by Jonathan Larkin?

What does Jonathan mean when he refers to the principle component eigenvectors as "per stock betas"?