Notebook

Price correlation vs return correlation

Generally correlations between equity prices are meaningless and can actually be misleading. Check for correlations between log returns instead. One way to think of this is that one typically looks for securities with high or low returns. One typically doesn't care about high or low prices. Generally check for correlations between the variables one is looking for.

A slightly more technical reason is that prices are 'non-stationary'. They have trends. One requirement for practically all statistics functions (including correlation) is that the data is 'stationary' otherwise the results are meaningless. However, luckily, the 'first difference' (ie returns) are generally close to being stationary and can therefore yield potentially meaningful results.

So, let's check it out...

Let' create two 'random walk' sets of stock prices. These will simulate the prices of two stocks A and B. Below is a function that returns a series of prices based upon random Brownian motion.

In [316]:
import pandas as pd
import numpy as np

def brownian_motion_series(start_date, end_date, mu = 0.1, sigma = 0.01, start_price = 20):        
    dates = pd.bdate_range(start_date, end_date)
    days = dates.size
    
    # Approx number of years
    years = days/252.0
    
    # Express things in percentage of a trading year
    years_per_day = years/days
    
    # Get a time series of days expressed as percent of year
    t = np.linspace(0, years, days)
    
    log_returns = np.random.standard_normal(size = days) 
    cum_returns = np.cumsum(log_returns)*np.sqrt(years_per_day)
    
    # Normalized returns
    norm_price_change = ((mu-0.5 * sigma**2) * t) + (sigma * cum_returns)
    
    # geometric brownian motion
    prices = start_price * np.exp(norm_price_change)  
    return pd.Series(prices, index=dates)

Don't worry about the code. Let's just see it in action and use the function to make a pandas dataframe with our two sets of random prices.

In [317]:
prices_A = brownian_motion_series(
            start_date='2015-01-01', 
            end_date='2017-01-01', 
            mu=0.1, sigma=0.1, 
            start_price=90)

prices_B = brownian_motion_series(
            start_date='2015-01-01', 
            end_date='2017-01-01', 
            mu=0.1, sigma=0.1, 
            start_price=80)

prices_df = pd.DataFrame(data={'prices_A': prices_A, 'prices_B': prices_B}, index=prices_A.index)
prices_df.plot()
Out[317]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1c6693fd0>

OK. Those look like stock charts we all are used to seeing. By re-running the cell above, one can generate different sets each time (they are random). Note that generally there will be trends. Prices will be going up or going down but often exhibit a trend. This is a feature of 'random walks'. This is also what makes this data 'non-stationary' and not suitable for most statistical calculations.

We'll look at the correlation between these prices a few cells down.

However, forst lets create a dataframe with the log returns of those prices and plot them.

In [318]:
returns_A = np.log(prices_df.prices_A).diff()
returns_B = np.log(prices_df.prices_B).diff()

log_return_df = pd.DataFrame(data={'returns_A': returns_A, 'returns_B': returns_B}, index=returns_A.index)
log_return_df.plot()
Out[318]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1c6493250>

Again, this looks like the typical returns from typical stock prices. Notice this data seems more 'stationary'. It doesn't move up or down in a trend but rather looks like 'noise' centered around zero. Stationary data like this is more suited to typical statistical analysis.

Let's get right to the original question of how correlated prices are vs how correlated returns are. It may seem like they should be similar? If one is correlated then the other should be?

Wrong.

Below we calculate the spearman correlations (what one generally thinks of as correlation) by using the built in Pandas '.corr' method.

In [319]:
prices_corr_df = prices_df.corr()
prices_corr_df
Out[319]:
prices_A prices_B
prices_A 1.00000 0.90613
prices_B 0.90613 1.00000
In [320]:
log_return_corr_df = log_return_df.corr()
log_return_corr_df
Out[320]:
returns_A returns_B
returns_A 1.000000 0.035602
returns_B 0.035602 1.000000

The '.corr' method returns the cross correlations of all the columns. A column always has a perfect 1.0 correlation with itself so you see the diagonal with all ones. The other values are the correlatons we are looking for.

Let's make the output a little prettier...

Re-run this notebook several times by selecting 'run all' from the run menu. Each time this will calculate a new set of random stock prices. The correlation between prices and betwen returns will print below.

Notice that every time the return correlation is small (generally under .1) while the price correlation can vary widely and often shows a high correlation (perhaps above .75).

The returns were generated completely at random so it's no surprise they should show virtually no correlation after each run.

The prices generated from these random returns however, show varying (and misleading) correlations. This is a feature of their non-stationary 'random walk' nature. This is also why one should look at log return correlations and not price corelations.

In [321]:
#prices_corr = prices_df.corr()[0,1]
prices_corr = prices_corr_df.get_value('prices_A', 'prices_B')
returns_corr = log_return_corr_df.get_value('returns_A', 'returns_B')

print "price correlation: {:.2f} / return correlation: {:.2f}".format(prices_corr, returns_corr)
price correlation: 0.91 / return correlation: 0.04