Notebook

Introduction to Pairs Trading

By Manuel Morales, Delaney Mackenzie and Maxwell Margenot

Part of the Quantopian Lecture in Formfintech/Quantopian Montreal Workshop:

Pairs trading is a classic example of a strategy based on mathematical analysis. The principle is as follows. Let's say you have a pair of securities X and Y that have some underlying economic link. An example might be two companies that manufacture the same product, or two companies in one supply chain. If we can model this economic link with a mathematical model, we can make trades on it. We'll start by constructing a toy example.

Before we proceed, note that the content in this lecture depends heavily on the Stationarity, Integration, and Cointegration lecture in order to properly understand the mathematical basis for the methodology that we employ here. It is recommended that you go through that lecture before this continuing.

In [34]:
import numpy as np
import pandas as pd

import statsmodels
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller
# just set the seed for the random number generator
np.random.seed(107)

import matplotlib.pyplot as plt

Generating Two Fake Securities

We model X's daily returns by drawing from a normal distribution. Then we perform a cumulative sum to get the value of X on each day.

In [35]:
X_returns = np.random.normal(0, 1, 100) # Generate the daily returns
# sum them and shift all the prices up into a reasonable range
X = pd.Series(np.cumsum(X_returns), name='X') + 50
#print X
X.plot();

Now we generate Y. Remember that Y is supposed to have a deep economic link to X, so the price of Y should vary pretty similarly. We model this by taking X, shifting it up and adding some random noise drawn from a normal distribution.

In [40]:
some_noise = np.random.normal(0, 1, 100)
Y = 0.95*X + 5 + some_noise
Y.name = 'Y'
#Y.plot()
pd.concat([X, Y], axis=1).plot();

Cointegration

We've constructed an example of two cointegrated series. Cointegration is a more subtle relationship than correlation. If two time series are cointegrated, there is some linear combination between them that will vary around a mean.

For more details on how we formally define cointegration and how to understand it, please see the Integration, Cointegration, and Stationarity lecture from the Quantopian Lecture Series.

We'll plot the difference between the two now so we can see how this looks.

In [4]:
(Y - X).plot() # Plot the spread
plt.axhline((Y - X).mean(), color='red', linestyle='--') # Add the mean
plt.xlabel('Time')
plt.legend(['Price Spread', 'Mean']);

Stationarity

Formally, let ${\displaystyle \left\{X*{t}\right\}}$ be a stochastic process and let ${\displaystyle F*{X}(x*{t*{1}+\tau },\ldots ,x*{t*{k}+\tau })}$ represent the cumulative distribution function of the joint distribution of ${\displaystyle \left\{X*{t}\right\}}$ at times ${\displaystyle t*{1}+\tau ,\ldots ,t*{k}+\tau }$. Then, ${\displaystyle \left\{X*{t}\right\}}$ is said to be strictly (or strongly) stationary if, for all ${\displaystyle {k}}$, for all ${\displaystyle {\tau }}$, and for all ${\displaystyle t*{1},\ldots ,t*{k}}$,

$${\displaystyle F*{X}(x*{t*{1}+\tau },\ldots ,x*{t*{k}+\tau })=F*{X}(x*{t*{1}},\ldots ,x*{t*{k}}).}$$

Since ${\displaystyle \tau }$ does not affect ${\displaystyle F*{X}(\cdot )}$, ${\displaystyle F*{X}}$ is not a function of time.

A commonly untested assumption in time series analysis is the stationarity of the data. Data are stationary when the parameters of the data generating process do not change over time.

Unit Root Series

Consider a discrete-time stochastic process ${\displaystyle \{y_{t},t=1,\ldots ,\infty \}}$, and suppose that it can be written as an autoregressive process of order $p$:

$${\displaystyle y*{t}=a*{1}y*{t-1}+a*{2}y*{t-2}+\cdots +a*{p}y*{t-p}+\varepsilon *{t}.}$$

Here, ${\displaystyle \{\varepsilon *{t},t=0,\infty \}}$ is a serially uncorrelated, zero-mean stochastic process with constant variance ${\displaystyle \sigma ^{2}}$. For convenience, assume ${\displaystyle y*{0}=0}$. If ${\displaystyle m=1}$ is a root of the characteristic equation:

$${\displaystyle m^{p}-m^{p-1}a*{1}-m^{p-2}a*{2}-\cdots -a_{p}=0}$$

then the stochastic process has a unit root or, alternatively, is integrated of order one, denoted ${\displaystyle I(1)}$. If $m = 1$ is a root of multiplicity $r$, then the stochastic process is integrated of order $r$, denoted $I(r)$.

Such a process is non-stationary but does not always have a trend.

Testing for Stationarity

Now we want to check for stationarity using a statistical test.

In [41]:
def check_for_stationarity(X, cutoff=0.01):
    # H_0 in adfuller is unit root exists (non-stationary)
    # We must observe significant p-value to convince ourselves that the series is stationary
    pvalue = adfuller(X)[1]
    if pvalue < cutoff:
        print 'p-value = ' + str(pvalue) + ' The series is likely stationary.'
        return True
    else:
        print 'p-value = ' + str(pvalue) + ' The series is likely non-stationary.'
        return False
In [42]:
adfuller?
In [43]:
check_for_stationarity(Y-X);
p-value = 6.86003265272e-15 The series is likely stationary.

Order of Integration

Moving Average Representation/Wold's Theorem

An important concept in time series analysis is moving average representation. We will discuss this briefly here, but a more complete explanation is available in the AR, MA, and ARMA Models lectures of the Quantopian Lecture Series. Also check Wikipedia as listed below.

This representation expresses any time series $Y_t$ as

$$Y*t = \sum*{j=0}^\infty b*j \epsilon*{t-j} + \eta_t$$

  • $\epsilon$ is the 'innovation' series
  • $b_j$ are the moving average weights of the innovation series
  • $\eta$ is a deterministic series

The key here is as follows. $\eta$ is deterministic, such as a sine wave. Therefore we could perfectly model it. The innovation process is stochastic and there to simulate new information occuring over time. Specifically, $\epsilon_t = \hat Y_t - Y_t$ where $\hat Y_t$ is the in the optimal forecast of $Y_t$ using only information from time before $t$. In other words, the best prediction you can make at time $t-1$ cannot account for the randomness in $\epsilon$.

Each $b_j$ just says how much previous values of $\epsilon$ influence $Y_t$.

Back to Order of Integration

We will note integration order-i as $I(i)$.

A time series is said to be $I(0)$ if the following condition holds in a moving average representation. In hand-wavy english, the autocorrelation of the series decays sufficiently quickly.

$$\sum_{k=0}^\infty |b_k|^2 < \infty$$

This property turns out to be true of all stationary series, but by itself is not enough for stationarity to hold. This means that stationarity implies $I(0)$, but $I(0)$ does not imply stationarity. For more on orders of integration, please see the following links.

https://en.wikipedia.org/wiki/Order_of_integration https://en.wikipedia.org/wiki/Wold%27s_theorem

Testing for $I(0)$

In practice testing whether the sum of the autocorrelations is finite may not be possible. It is possible in a mathematical derivation, but when we have a finite set of data and a finite number of estimated autocorrelations, the sum will always be finite. Given this difficulty, tests for $I(0)$ rely on stationarity implying the property. If we find that a series is stationary, then it must also be $I(0)$.

Integration

A time series is integrated of order d if

$${\displaystyle (1-L)^{d}X_{t}\ } $$ is a stationary process, where ${\displaystyle L}$ is the lag operator and ${\displaystyle 1-L}$ is the first difference, i.e.

$${\displaystyle (1-L)X*{t}=X*{t}-X_{t-1}=\Delta X.} $$

In other words, a process is integrated to order d if taking repeated differences d times yields a stationary process.

Cointegration

Finally, now that we've discussed stationarity and order of integration, we can discuss cointegration.

Def: Linear Combination

A linear combination of the time series ($X_1$, $X_2$, $\dots$, $X_k$) is a new time series $Y$ constructed as follows for any set of real numbers $b_1 \dots b_k$

$$Y = b_1X_1 + b_2X_2 + \dots + b_kX_k$$

Formal Definition

The formal definition of cointegration is as follows.

For some set of time series ($X_1$, $X_2$, $\dots$, $X_k$), if all series are $I(1)$, and some linear combination of them is $I(0)$, we say the set of time series is cointegrated.

Example

$X_1$, $X_2$, and $X_3$ are all $I(1)$, and $2X_1 + X_2 + 0X_3 = 2X_1 + X_2$ is $I(0)$. In this case the time series are cointegrated.

Intuition

The intuition here is that for some linear combination of the series, the result lacks much auto-covariance and is mostly noise. This is useful for cases such as pairs trading, in which we find two assets whose prices are cointegrated. Since the linear combination of their prices $b_1A_1 + b_2A_2$ is noise, we can bet on the relationship $b_1A_1 + b_2A_2$ mean reverting and place trades accordingly. See the Pairs Trading Lecture in the Quantopian Lecture Series for more information.

Testing for Cointegration

That's an intuitive definition, but how do we test for this statistically?

There are a bunch of ways to test for cointegration. This wikipedia article describes some. In general we're just trying to solve for the coefficients $b_1, \dots b_k$ that will produce an $I(0)$ linear combination. If our best guess for these coefficients does not pass a stationarity check, then we reject the hypothesis that the set is cointegrated. This will lead to risk of Type II errors (false negatives), as we will not exhaustively test for stationarity on all coefficent combinations. However Type II errors are generally okay here, as they are safe and do not lead to us making any wrong forecasts.

In practice a common way to do this for pairs of time series is to use linear regression to estimate $\beta$ in the following model.

$$Y = \alpha + \beta X + \epsilon$$

The idea is that if the two are cointegrated we can remove $Y$'s depedency on $X$, leaving behind stationary noise. The combination $Y - \beta X = \alpha + \epsilon$ should be stationary.

There is a convenient cointegration test that lives in statsmodels.tsa.stattools.

Let's say that our confidence level is $0.05$. We should see a p-value below our cutoff, as we've artifically created two series that are the textbook definition of cointegration.

In [8]:
# compute the p-value of the cointegration test
# will inform us as to whether the spread between the 2 timeseries is stationary
# around its mean
score, pvalue, _ = coint(X,Y)
coint(X,Y)[1]
print pvalue
4.6942241488e-16
In [9]:
def check_for_cointegration(X,Y, cutoff=0.05):
    #Uses unit-root test on residuals to test for cointegrated relationship
    #See Hamilton (1994) 19.2
    # H_0 is that there is no cointegration i.e. that the residuals have are unit root series (non-stationary)
    # We must observe significant p-value to convince ourselves that the series are cointegrated
    pvalue = coint(X,Y)[1]
    if pvalue < cutoff:
        print 'p-value = ' + str(pvalue) + ' The series are likely cointegrated.'
        return True
    else:
        print 'p-value = ' + str(pvalue) + ' The series are likely not cointegrated.'
        return False
In [10]:
check_for_cointegration(X,Y);
p-value = 4.6942241488e-16 The series are likely cointegrated.

Correlation vs. Cointegration

Correlation and cointegration, while theoretically similar, are not the same. To demonstrate this, we'll show examples of series that are correlated, but not cointegrated, and vice versa. To start let's check the correlation of the series we just generated.

In [11]:
X.corr(Y)
Out[11]:
0.94448171575816409

That's very high, as we would expect. But how would two series that are correlated but not cointegrated look?

Correlation Without Cointegration

A simple example is two series that just diverge.

In [12]:
X_returns = np.random.normal(1, 1, 100)
Y_returns = np.random.normal(2, 1, 100)

X_diverging = pd.Series(np.cumsum(X_returns), name='X')
Y_diverging = pd.Series(np.cumsum(Y_returns), name='Y')

pd.concat([X_diverging, Y_diverging], axis=1).plot();
In [13]:
print 'Correlation: ' + str(X_diverging.corr(Y_diverging))
score, pvalue, _ = coint(X_diverging,Y_diverging)
print 'Cointegration test p-value: ' + str(pvalue)
Correlation: 0.993134380128
Cointegration test p-value: 0.884633444839

Cointegration Without Correlation

A simple example of this case is a normally distributed series and a square wave.

In [14]:
Y2 = pd.Series(np.random.normal(0, 1, 1000), name='Y2') + 20
Y3 = Y2.copy()
In [15]:
# Y2 = Y2 + 10
Y3[0:100] = 30
Y3[100:200] = 10
Y3[200:300] = 30
Y3[300:400] = 10
Y3[400:500] = 30
Y3[500:600] = 10
Y3[600:700] = 30
Y3[700:800] = 10
Y3[800:900] = 30
Y3[900:1000] = 10
In [16]:
Y2.plot()
Y3.plot()
plt.ylim([0, 40]);
In [17]:
# correlation is nearly zero
print 'Correlation: ' + str(Y2.corr(Y3))
score, pvalue, _ = coint(Y2,Y3)
print 'Cointegration test p-value: ' + str(pvalue)
Correlation: -0.0413040695809
Cointegration test p-value: 0.0

Sure enough, the correlation is incredibly low, but the p-value shows that these are cointegrated.

Hedging

Because you'd like to protect yourself from bad markets, often times short sales will be used to hedge long investments. Because a short sale makes money if the security sold loses value, and a long purchase will make money if a security gains value, one can long parts of the market and short others. That way if the entire market falls off a cliff, we'll still make money on the shorted securities and hopefully break even. In the case of two securities we'll call it a hedged position when we are long on one security and short on the other.

The Trick: Where it all comes together

Because the securities drift towards and apart from each other, there will be times when the distance is high and times when the distance is low. The trick of pairs trading comes from maintaining a hedged position across X and Y. If both securities go down, we neither make nor lose money, and likewise if both go up. We make money on the spread of the two reverting to the mean. In order to do this we'll watch for when X and Y are far apart, then short Y and long X. Similarly we'll watch for when they're close together, and long Y and short X.

Going Long the Spread

This is when the spread is small and we expect it to become larger. We place a bet on this by longing Y and shorting X.

Going Short the Spread

This is when the spread is large and we expect it to become smaller. We place a bet on this by shorting Y and longing X.

Specific Bets

One important concept here is that we are placing a bet on one specific thing, and trying to reduce our bet's dependency on other factors such as the market.

Finding real securities that behave like this

The best way to do this is to start with securities you suspect may be cointegrated and perform a statistical test. If you just run statistical tests over all pairs, you'll fall prey to multiple comparison bias.

Here's a method to look through a list of securities and test for cointegration between all pairs. It returns a cointegration test score matrix, a p-value matrix, and any pairs for which the p-value was less than $0.05$.

WARNING: This will incur a large amount of multiple comparisons bias.

The methods for finding viable pairs all live on a spectrum. At one end there is the formation of an economic hypothesis for an individual pair. You have some extra knowledge about an economic link that leads you to believe that the pair is cointegrated, so you go out and test for the presence of cointegration. In this case you will incur no multiple comparisons bias. At the other end of the spectrum, you perform a search through hundreds of different securities for any viable pairs according to your test. In this case you will incur a very large amount of multiple comparisons bias.

Multiple comparisons bias is the increased chance to incorrectly generate a significant p-value when many tests are run. If 100 tests are run on random data, we should expect to see 5 p-values below $0.05$ on expectation. Because we will perform $n(n-1)/2$ comparisons, we should expect to see many incorrectly significant p-values. For the sake of example will will ignore this and continue. In practice a second verification step would be needed if looking for pairs this way. Another approach is to pick a small number of pairs you have reason to suspect might be cointegrated and test each individually. This will result in less exposure to multiple comparisons bias. You can read more about multiple comparisons bias here.

In [44]:
def find_cointegrated_pairs(data):
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(n):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

Looking for Cointegrated Pairs of Energy Securities

We are looking through a set of solar company stocks to see if any of them are cointegrated. We'll start by defining the list of securities we want to look through. Then we'll get the pricing data for each security for the year of 2014.

Our approach here is somewhere in the middle of the spectrum that we mentioned before. We have formulated an economic hypothesis that there is some sort of link between a subset of securities within the energy sector and we want to test whether there are any cointegrated pairs. This incurs significantly less multiple comparisons bias than searching through hundreds of securities and slightly more than forming a hypothesis for an individual test.

NOTE: We include the market in our data. This is because the market drives the movement of so many securities that you often times might find two seemingingly cointegrated securities, but in reality they are not cointegrated and just both conintegrated with the market. This is known as a confounding variable and it is important to check for market involvement in any relationship you find.

get_pricing() is a Quantopian method that pulls in stock data, and loads it into a Python Pandas DataPanel object. Available fields are 'price', 'open_price', 'high', 'low', 'volume'. But for this example we will just use 'price' which is the daily closing price of the stock.

In [13]:
symbol_list = ['C','ABGB', 'AXP', 'INTC', 'IBM', 'FSLR', 'MMM', 'JNJ', 'XOM', 'CAT','SPY']
prices_df = get_pricing(symbol_list, fields=['price']
                               , start_date='2015-01-01', end_date='2016-01-01')['price']
prices_df.columns = map(lambda x: x.symbol, prices_df.columns)

Example of how to get all the prices of all the stocks loaded using get_pricing() above in one pandas dataframe object

In [14]:
prices_df.head()
Out[14]:
C ABGB AXP INTC IBM FSLR MMM JNJ XOM CAT SPY
2015-01-02 00:00:00+00:00 54.082 11.375 91.766 35.288 156.954 44.540 159.803 101.483 89.732 88.596 201.248
2015-01-05 00:00:00+00:00 52.367 10.812 89.350 34.910 154.465 41.850 156.275 100.861 87.344 83.901 197.720
2015-01-06 00:00:00+00:00 50.532 11.860 87.397 34.240 151.105 40.843 154.541 100.278 86.802 83.323 195.790
2015-01-07 00:00:00+00:00 51.021 12.782 89.266 34.978 150.079 41.750 155.720 102.522 87.711 84.672 198.240
2015-01-08 00:00:00+00:00 51.768 12.229 90.552 35.618 153.410 43.630 159.296 103.318 89.142 85.511 201.747

Example of how to get just the prices of a single stock that was loaded using get_pricing() above

In [54]:
prices_df['SPY'].head()
Out[54]:
2015-01-02 00:00:00+00:00    201.248
2015-01-05 00:00:00+00:00    197.720
2015-01-06 00:00:00+00:00    195.790
2015-01-07 00:00:00+00:00    198.240
2015-01-08 00:00:00+00:00    201.747
Freq: C, Name: SPY, dtype: float64

Now we'll run our method on the list and see if any pairs are cointegrated.

In [15]:
# Heatmap to show the p-values of the cointegration test between each pair of
# stocks. Only show the value in the upper-diagonal of the heatmap
scores, pvalues, pairs = find_cointegrated_pairs(prices_df)
import seaborn
seaborn.heatmap(pvalues, xticklabels=symbol_list, yticklabels=symbol_list, cmap='RdYlGn_r' 
                #, mask = (pvalues >= 0.05)
                )
print pairs
[(u'ABGB', u'CAT'), (u'AXP', u'IBM'), (u'AXP', u'CAT'), (u'IBM', u'CAT'), (u'MMM', u'XOM')]

Looks like 'MMM' and 'XOM' are cointegrated. Let's take a look at the prices to make sure there's nothing weird going on.

In [16]:
S1 = prices_df['MMM']
S2 = prices_df['XOM']
In [17]:
S1.head()
Out[17]:
2015-01-02 00:00:00+00:00    159.803
2015-01-05 00:00:00+00:00    156.275
2015-01-06 00:00:00+00:00    154.541
2015-01-07 00:00:00+00:00    155.720
2015-01-08 00:00:00+00:00    159.296
Freq: C, Name: MMM, dtype: float64
In [18]:
score, pvalue, _ = coint(S1, S2)
pvalue
Out[18]:
0.025653049938503331
In [19]:
check_for_cointegration(S1,S2);
p-value = 0.0256530499385 The series are likely cointegrated.
In [20]:
plt.plot(S1.index, S1.values)
plt.ylabel('Price')
plt.legend([S1.name]);
plt.plot(S2.index, S2.values)
plt.ylabel('Price')
plt.legend([S2.name]);

Calculating the Spread

Now we will plot the spread of the two series. In order to actually calculate the spread, we use a linear regression to get the coefficient for the linear combination to construct between our two securities, as shown in the stationarity lecture. Using a linear regression to estimate the coefficient is known as the Engle-Granger method.

In [21]:
#S1.head()
#SS=sm.add_constant(S1)
#SS.head()
S1 = sm.add_constant(S1)
results = sm.OLS(S2, S1).fit()
S1 = S1['MMM']
b = results.params['MMM']
print b
0.558660696914
In [22]:
XX = sm.add_constant(X)
results = sm.OLS(Y, XX).fit()
#print results.params
#S1 = S1['ABGB']
b = results.params['X']
print b
0.907697424231
In [23]:
S1 = sm.add_constant(S1)
results = sm.OLS(S2, S1).fit()
S1 = S1['MMM']
b = results.params['MMM']

spread = S2 - b * S1
spread.plot()
plt.axhline(spread.mean(), color='black')
plt.legend(['Spread']);

Alternatively, we could examine the ratio betwen the two series.

In [41]:
ratio = S1/S2
ratio.plot()
plt.axhline(ratio.mean(), color='black')
plt.legend(['Price Ratio']);

Examining the price ratio of a trading pair is a traditional way to handle pairs trading. Part of why this works as a signal is based in our assumptions of how stock prices move, specifically because stock prices are typically assumed to be log-normally distributed. What this implies is that by taking a ratio of the prices, we are taking a linear combination of the returns associated with them (since prices are just the exponentiated returns).

This can be a little irritating to deal with for our purposes as purchasing the precisely correct ratio of a trading pair may not be practical. We choose instead to move forward with simply calculating the spread between the cointegrated stocks using linear regression. This is a very simple way to handle the relationship, however, and is likely not feasible for non-toy examples. There are other potential methods for estimating the spread listed at the bottom of this lecture. If you want to get more into the theory of why having cointegrated stocks matters for pairs trading, again, please see the Integration, Cointegration, and Stationarity Lecture from the Quantopian Lecture Series.

So, back to our example. The absolute spread isn't very useful in statistical terms. It is more helpful to normalize our signal by treating it as a z-score.

WARNING

In practice this is usually done to try to give some scale to the data, but this assumes some underlying distribution. Usually normal. Under a normal distribution, we would know that approximately 84% of all spread values will be smaller. However, much financial data is not normally distributed, and one must be very careful not to assume normality, nor any specific distribution when generating statistics. It could be the case that the true distribution of spreads was very fat-tailed and prone to extreme values. This could mess up our model and result in large losses.

In [24]:
def zscore(series):
    return (series - series.mean()) / np.std(series)
In [25]:
zscore(spread).plot()
plt.axhline(zscore(spread).mean(), color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Spread z-score', 'Mean', '+1', '-1']);

Simple Strategy:

  • Go "Long" the spread whenever the z-score is below -1.0
  • Go "Short" the spread when the z-score is above 1.0
  • Exit positions when the z-score approaches zero

This is just the tip of the iceberg, and only a very simplistic example to illustrate the concepts. In practice you would want to compute a more optimal weighting for how many shares to hold for S1 and S2. Some additional resources on pair trading are listed at the end of this notebook

Trading using constantly updating statistics

In general taking a statistic over your whole sample size can be bad. For example, if the market is moving up, and both securities with it, then your average price over the last 3 years may not be representative of today. For this reason traders often use statistics that rely on rolling windows of the most recent data.

Moving Averages

A moving average is just an average over the last $n$ datapoints for each given time. It will be undefined for the first $n$ datapoints in our series. Shorter moving averages will be more jumpy and less reliable, but respond to new information quickly. Longer moving averages will be smoother, but take more time to incorporate new information.

We also need to use a rolling beta, a rolling estimate of how our spread should be calculated, in order to keep all of our parameters up to date.

In [30]:
pd.ols?
In [26]:
rolling_beta = pd.ols(y=S1, x=S2, window_type='rolling', window=30)
spread = S2 - rolling_beta.beta['x'] * S1
spread.name = 'spread'
#print rolling_beta.beta
#print spread.head()
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  """Entry point for launching an IPython kernel.
In [27]:
# Get the spread between the 2 stocks
# Calculate rolling beta coefficient
rolling_beta = pd.ols(y=S1, x=S2, window_type='rolling', window=30)
spread = S2 - rolling_beta.beta['x'] * S1
spread.name = 'spread'

# Get the 1 day moving average of the price spread
spread_mavg1 = pd.rolling_mean(spread, window=1)
spread_mavg1.name = 'spread 1d mavg'

# Get the 30 day moving average
spread_mavg30 = pd.rolling_mean(spread, window=30)
spread_mavg30.name = 'spread 30d mavg'

plt.plot(spread_mavg1.index, spread_mavg1.values)
plt.plot(spread_mavg30.index, spread_mavg30.values)

plt.legend(['1 Day Spread MAVG', '30 Day Spread MAVG'])

plt.ylabel('Spread');
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:3: FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  This is separate from the ipykernel package so we can avoid doing imports until
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:8: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=1,center=False).mean()
  
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:12: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=30,center=False).mean()
  if sys.path[0] == '':

We can use the moving averages to compute the z-score of the spread at each given time. This will tell us how extreme the spread is and whether it's a good idea to enter a position at this time. Let's take a look at the z-score now.

In [46]:
# Take a rolling 30 day standard deviation
std_30 = pd.rolling_std(spread, window=30)
std_30.name = 'std 30d'

# Compute the z score for each day

zscore_30_1 = (spread_mavg1 - spread_mavg30)/std_30
zscore_30_1.name = 'z-score'
zscore_30_1.plot()
plt.axhline(0, color='black')
plt.axhline(-1.0, color='green', linestyle='--')
plt.axhline(1.0, color='red', linestyle='--');
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:2: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=30,center=False).std()
  

The z-score doesn't mean much out of context, let's plot it next to the prices to get an idea of what it looks like. We'll take the negative of the z-score because the spreads were all negative and that is a little counterintuitive to trade on.

In [47]:
# Plot the prices scaled down along with the negative z-score
# just divide the stock prices by 10 to make viewing it on the plot easier
plt.plot(S1.index, S1.values/10)
plt.plot(S2.index, S2.values/10)
plt.plot(zscore_30_1.index, zscore_30_1.values)
plt.legend(['S1 Price / 10', 'S2 Price / 10', 'Price Spread Rolling z-Score']);

Out of Sample Test

Now that we have constructed our spread appropriately and have an idea of how we will go about making trades, it is time to conduct some out of sample testing. Our whole model is based on the premise that these securities are cointegrated, but we built it on information from a certain time period. If we actually want to implement this model, we need to conduct an out of sample test to confirm that the principles of our model are still valid going forward.

Since we initially built the model on the 2014 - 2015 year, let's see if this cointegrated relationship holds for 2015 - 2016. Historical results do not guarantee future results so this is a sanity check to see if the work we have done holds strong.

In [31]:
symbol_list = ['MMM', 'XOM']
prices_df = get_pricing(symbol_list, fields=['price']
                               , start_date='2016-01-01', end_date='2017-01-01')['price']
prices_df.columns = map(lambda x: x.symbol, prices_df.columns)
In [32]:
S1 = prices_df['MMM']
S2 = prices_df['XOM']
In [33]:
score, pvalue, _ = coint(S1, S2)
print 'p-value: ', pvalue
p-value:  0.337737244664

Unfortunately, since our p-value is above the cutoff of $0.05$, we conclude that our model will no longer be valid due to the lack of cointegration between our chosen securities. If we tried to deploy this model without the underlying assumptions holding, we would have no reason to believe that it would actually work. Out of sample testing is a vital step to make sure that our work will actually be viable in the market.

Implementation

When actually implementing a pairs trading strategy you would normally want to be trading many different pairs at once. If you find a good pair relationship by analyzing data, there is no guarantee that that relationship will continue into the future. Trading many different pairs creates a diversified portfolio to mitigate the risk of individual pairs "falling out of" cointegration.

There is a template algorithm attached to this lecture that shows an example of how you would implement pairs trading on our platform. Feel free to check it out and modify it with your own pairs to see if you can improve it.

Further Research

This notebook contained some simple introductory approaches. In practice one should use more sophisticated statistics, some of which are listed here.

  • Augmented-Dickey Fuller test
  • Hurst exponent
  • Half-life of mean reversion inferred from an Ornstein–Uhlenbeck process
  • Kalman filters

(this is not an endorsement) But, a very good practical resource for learning more about pair trading is Dr. Ernie Chan's book: Algorithmic Trading: Winning Strategies and Their Rationale

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.