Notebook

Strategy Research on Quantopian

A demonstration of one of the several strategy development workflows using Quantopian's research and backtesting platform.

Overview

1. Look for an idea 
    - academia / intuition

2. Replicate or create results

3. plausability check 
    - Holding period / signal decay
    - costs / slippage
    - feasability at capital level
    - capacity
    - accept or go to 1.

4. Design an algorithm & Backtest
    - Repeat step 3

5. Make it your own
    - Apply your bread & butter
    - Repeat step 3

6. Paper trade
    - length of paper trading depends on turnover
    - Repeat step 3

7. Live trade with low stakes
    - Repeat step 3
        - Compare live results with backtested estimations
        - slippage / transactions
        - Non-fills; are they an issue

8. Add to the portfolio and allocate capital

9. Go to 1.

Trading multiple strategies is how algorithmic traders diversify and scale their operation. Algorithmic execution allows traders to focus on research.

Looking For Ideas

For this demo I looked to an academic paper from MIT

"Developing high-frequency equities trading models," by Leandro Rafael Infantino & Savion Itzhaki

Abstract of the Abstract:

purpose

  • Show evidence that there are opportunities to generate alpha in the high frequency environment of the US equity market

How

  • Use Principal Component Analysis (PCA) on log returns as a basis for short term valuation and market movements prediction.

Trading Frequency / Holding Period

  • One second to 5 minutes approximately.

Assumptions

  • Instant fills at mid-price

Replicating The Theoretical Model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import itertools
import statsmodels as sm
from sklearn.decomposition import PCA

def pplot(df, **plot_kwargs):
    try:
        return pd.DataFrame(df.values, columns=df.columns).plot(**plot_kwargs)
    except:
        return pd.Series(df.values).plot(**plot_kwargs)

Define a universe

Rather than exactly replicating the results from the paper I will use a basket of liquid ETFs that span the market.

In [2]:
FUNDS = symbols([
    'SPY', # SPDR S&P 500,                             'Large Cap Blend Equities'
    'MDY', # SPDR MidCap Trust Series I,               'Mid Cap Blend Equities'
    'EWI', # iShares MSCI Italy Capped ETF,            'Europe Equities'
    'EWS', # iShares MSCI Singapore ETF,               'Asia Pacific Equities'
    'EWO', # iShares MSCI Austria Capped ETF,          'Europe Equities'
    'EWD', # iShares MSCI Sweden ETF,                  'Europe Equities'
    'EWL', # iShares MSCI Switzerland Capped ETF,      'Europe Equities'
    'EWQ', # iShares MSCI France ETF,                  'Europe Equities'
    'EWP', # iShares MSCI Spain Capped ETF,            'Europe Equities'
    'EWG', # iShares MSCI Germany ETF,                 'Europe Equities'
    'EWM', # iShares MSCI Malaysia ETF,                'Emerging Markets Equities'
    'EWK', # iShares MSCI Belgium Capped ETF,          'Europe Equities'
    'EWW', # iShares MSCI Mexico Capped ETF,           'Latin America Equities'
    'EWN', # iShares MSCI Netherlands ETF,             'Europe Equities'
    'EWC', # iShares MSCI Canada ETF,                  'Foreign Large Cap Equities'
    'EWH', # iShares MSCI Hong Kong ETF,               'China Equities'
    'EWJ', # iShares MSCI Japan ETF,                   'Japan Equities'
    'EWU', # iShares MSCI United Kingdom ETF,          'Europe Equities'
    'EWA', # MSCI Australia ETF,                       'Asia Pacific Equities'
    'DIA', # Dow Jones Industrial Average ETF,         'Large Cap Value Equities'
    'XLP', # Consumer Staples Select Sector SPDR,      'Consumer Staples Equities'
    'XLY', # Consumer Discretionary Select Sector SPDR,'Consumer Discretionary Equities'
    'XLB', # Materials Select Sector SPDR,             'Materials'
    'XLU', # Utilities Select Sector SPDR,             'Utilities Equities'
    'XLF', # Financial Select Sector SPDR,             'Financials Equities'
    'XLE', # Energy Select Sector SPDR,                'Energy Equities'
    'XLK', # Technology Select Sector SPDR,            'Technology Equities'
    'XLV', # Health Care Select Sector SPDR,           'Health & Biotech Equities'
    'XLI', # Industrial Select Sector SPDR,            'Industrials Equities'
    'QQQ', # QQQ,                                      'Large Cap Growth Equities'
    'EWY', # iShares MSCI South Korea Capped ETF,      'Asia Pacific Equities'
    'IVV', # Core S&P 500 ETF,                         'Large Cap Blend Equities'
    'IYW', # iShares U.S. Technology ETF,              'Technology Equities'
    'IWB', # iShares Russell 1000 ETF,                 'Large Cap Blend Equities'
    'IWD', # iShares Russell 1000 Value ETF,           'Large Cap Value Equities'
    'IVW', # iShares S&P 500 Growth ETF,               'Large Cap Growth Equities'
    'IYF', # iShares U.S. Financials ETF,              'Financials Equities'
    'IWV', # iShares Russell 3000 ETF,                 'All Cap Equities'
    'IYZ', # iShares U.S. Telecommunications ETF,      'Communications Equities'
    'IWF', # iShares Russell 1000 Growth ETF,          'Large Cap Growth Equities'
    'IWM', # iShares Russell 2000 ETF,                 'Small Cap Blend Equities'
    'IVE', # iShares S&P 500 Value ETF,                'Large Cap Value Equities'
    'IJH', # Core S&P Mid-Cap ETF,                     'Mid Cap Blend Equities'
    'IJR', # Core S&P Small-Cap ETF,                   'Small Cap Blend Equities'
    'IDU', # iShares U.S. Utilities ETF,               'Utilities Equities'
    'IYE', # iShares U.S. Energy ETF,                  'Energy Equities'
    'IYC', # iShares U.S. Consumer Services ETF,       'Consumer Discretionary Equities'
    'IYY', # iShares Dow Jones U.S. ETF,               'All Cap Equities'
    'IYM', # iShares U.S. Basic Materials ETF,         'Materials'
    'IYG', # iShares U.S. Financial Services ETF,      'Financials Equities'
    'IYK', # Dow Jones U.S. Consumer Goods Index Fund, 'Consumer Staples Equities'
    'IYH', # iShares U.S. Healthcare ETF,              'Health & Biotech Equities'
    'IYJ', # iShares U.S. Industrials ETF,             'Industrials Equities'
    'EWT', # iShares MSCI Taiwan ETF,                  'Asia Pacific Equities'
    'EWZ', # iShares MSCI Brazil Capped ETF,           'Latin America Equities'
    'IWN', # iShares Russell 2000 Value ETF,           'Small Cap Value Equities'
    'IJK', # iShares U.S. Consumer Goods ETF,          'Mid Cap Growth Equities'
    'IJS', # iShares S&P Small-Cap 600 Value ETF,      'Small Cap Value Equities'
    'IJT', # iShares S&P Small-Cap 600 Growth ETF,     'Small Cap Growth Equities'
    'IJJ', # iShares S&P Mid-Cap 400 Value ETF,        'Mid Cap Value Equities'
    'IWO', # iShares Russell 2000 Growth ETF,          'Small Cap Growth Equities' 
    'IBB', # Nasdaq Biotechnology,                     'Health & Biotech Equities'
    
    'SHY', # 1-3 Year Treasury Bond ETF                'Government Bonds'
    'IEF', # iShares 7-10 Year Treasury Bond ETF,      'Government Bonds'
    'TLT', # 20+ Year Treasury Bond ETF,               'Government Bonds'
    'LQD', # iShares iBoxx Investment Grade            'Corporate Bonds'
    
    'IYR', # iShares U.S. Real Estate ETF              'Real Estate'
    'ICF', # iShares Cohen & Steers REIT ETF           'Real Estate'
    'RWR', # SPDR DJ Wilshire REIT ETF                 'Real Estate'
])

prices = get_pricing(FUNDS, 
                     start_date='2010-12-10', 
                     end_date='2011-01-04', 
                     frequency='minute', 
                     fields='price', 
                     handle_missing='ignore').ffill()

prices = prices.dropna()
log_prices = np.log(prices)
log_rets = log_prices.diff().dropna()

Visualize the Data

In [3]:
fig = pplot(
    log_prices / log_prices.iloc[0],
    title='Normalized Log Asset Prices',
    legend=False,
)
fig.set_xlabel("Timesteps")
Out[3]:
<matplotlib.text.Text at 0x7f23d8153cd0>

Do the Math!!

Thank god for numpy/scipy/pandas/sklearn
In [4]:
def get_PCA_spreads(prices, n_components=10, compound_periods=1):
    """
    Returns the spread between the log returns  
    and the model estimate.
    
    :Params:
        prices: pandas.DataFrame
            pricing data used to fit the model.
        n_components: integer
            number of principal components used
        compound_periods: integer
            log returns are summed over this many periods.
            Uses the fact that log returns are time additive.
            e.g. compound_periods=5 would correspond to using
                 5 minute compounded returns.
    :Returns:
        pandas.DataFrame
            actual returns - model estimations 
    """
    pca = PCA(n_components=n_components)
    X = np.log(prices).diff().iloc[1:]
    if compound_periods > 1:
        X = pd.rolling_sum(X, compound_periods).iloc[compound_periods:]
    M = X.mean()
    Y = X - M
    model = pca.fit(Y)
    psi = model.components_.T
    D = pd.DataFrame(np.dot(psi.T, Y.T).T, index=Y.index) # same as model.transform(Y)
    regressions = {}
    for asset in Y:
        y = Y[asset]
        x = D
        regressions[asset] = pd.ols(y=y, x=x)
    B = pd.DataFrame({s: regressions[s].beta 
                      for s in regressions}).drop('intercept')
    S = D.dot(B)
    S_est = (S + M).dropna()
    spreads = (X - S_est).dropna()
    return spreads

n_components = 5
compound_periods = 0
spreads = get_PCA_spreads(prices, 
                          n_components=n_components,
                          compound_periods=compound_periods)

Visualize Output

In [5]:
fig = pplot(
    spreads.cumsum(),
    legend=False,
    title="Cumulative Spread Return: (Return - model estimate)"
)
fig.set_xlabel('Timesteps (minutes)')
fig.set_ylabel('Cumulative returns')
Out[5]:
<matplotlib.text.Text at 0x7f23d86d4690>
In [6]:
spreads.T.sum().hist(bins=150)
print sm.stats.stattools.jarque_bera(spreads.T.sum())

# Not even close to normally distributed, p-value = 0.0
(315541.18304497265, 0.0, 0.6413328134137101, 36.84933164317511)

Apply Prediction Transformation

$-sign(S_{t-1}) * R_t$

Where $S_t$ is the spread at time $t$ and $R_t$ is the returns at time $t$

In [8]:
def make_sign_frame(signals, stepsize=1, idx0=0):
    """
    Returns a dataframe with the shape as signals
    where sign of each element has been copied.
    
    :Params:
        signals: Series/DataFrame
            signals centered around 0 
        stepsize: integer
            Using a stepsize=N copies the sign of every Nth
            time step, remaining values are forward filled.
        idx0: int
            the starting index used, previous elements 
            will be NaNs.
    """
    signs = np.nan * signals
    signs.iloc[idx0::stepsize] = np.copysign(1, signals.iloc[idx0::stepsize])
    signs = signs.ffill()
    return signs

directions = -make_sign_frame(spreads)
predictions = (log_rets * directions.shift(1))

fig = pplot(
    predictions.cumsum(),
    legend=False,
    title="Asset Returns Using One Step Ahead Prediction"
)
fig.set_xlabel('Timesteps (minutes)')
fig.set_ylabel('Cumulative Asset Return')
Out[8]:
<matplotlib.text.Text at 0x7f23a110d250>
In [9]:
fig = pplot(
    predictions.T.sum().cumsum(),
    legend=False,
    title="Strategy Returns Using One Step Ahead Prediction"
)
fig.set_xlabel('Timesteps (minutes)')
fig.set_ylabel('Cumulative Strategy Return')
Out[9]:
<matplotlib.text.Text at 0x7f23a13ace90>

This looks good, but it's guilty of look-ahead bias!

We can run a naive backtest over a rolling window to get a less biased result.

In [10]:
rolling_spreads = {}
for i in xrange(500, len(prices)):
    _prices = prices.iloc[i-500: i]
    rolling_spreads[prices.index[i]] = get_PCA_spreads(_prices).iloc[-1]
    
rolling_spreads = pd.DataFrame(rolling_spreads).T
In [11]:
directions = -make_sign_frame(rolling_spreads)
predictions = (log_rets * directions)

fig = pplot(
    predictions.cumsum(),
    legend=False,
    title="Asset returns (immediate execution)"
)
fig.set_xlabel('Timesteps (minutes)')
fig.set_ylabel('Cumulative Asset Return')
Out[11]:
<matplotlib.text.Text at 0x7f23a2fc7a50>
In [14]:
directions = -make_sign_frame(rolling_spreads)
predictions = (log_rets * directions).T.sum()
fig = pplot(
    predictions.cumsum(),
    title="Strategy returns (immediate execution)"
)

Stress Testing and Signal Decay

How does a longer holding period effect the results

In [62]:
def get_turnover_from_weights(weights):
    """
    Returns the daily strategy turnover given
    a DataFrame of portfolio target weights.
    """
    return weights.diff().abs().T.sum() / 2

def get_proportional_signal_weights(signals, leverage=1.0):
    """
    returns a set of portfolio weights proportional
    to the signal values.
    """
    return leverage * (signals.T / signals.T.abs().sum()).T

results = {}
turnover = {}
for i in range(1, 21):
    directions = -make_sign_frame(rolling_spreads, i)
    results[i] = (log_rets * directions).T.sum()
    weights = get_proportional_signal_weights(directions)
    turnover[i] = get_turnover_from_weights(weights)
    

results = pd.DataFrame(results)
turnover = pd.DataFrame(turnover)

fig = pplot(
    results.cumsum(),
    title="Returns Vs. Holding Period"
)
fig.set_xlabel('Timesteps (minutes)')
fig.set_ylabel('Cumulative Returns')
Out[62]:
<matplotlib.text.Text at 0x7f233beed950>

Execution Delay

What happens as we wait longer to place orders?

In [18]:
results_delayed = {}
for i in range(0, 21):
    directions = -make_sign_frame(rolling_spreads, 1)
    results_delayed[i] = (log_rets * directions.shift(i)).T.sum()
    weights = get_proportional_signal_weights(directions)
    

results_delayed = pd.DataFrame(results_delayed)

fig = pplot(
    results_delayed.cumsum(),
    title="Returns Vs. Execution Delay"
)
fig.set_xlabel('Timesteps (minutes)')
fig.set_ylabel('Cumulative Returns')
Out[18]:
<matplotlib.text.Text at 0x7f23a0680c90>

Slippage/Turnover Analysis

In [ ]:
def annual_return(returns, style='compound'):
    """Determines the annual returns of a strategy.

    Parameters
    ----------
    returns : pd.Series
        Daily returns of the strategy, noncumulative.
         - See full explanation in tears.create_full_tear_sheet.
    style : str, optional
        - If 'compound', then return will be calculated in geometric
          terms: (1+mean(all_daily_returns))^252 - 1.
        - If 'calendar', then return will be calculated as
          ((last_value - start_value)/start_value)/num_of_years.
        - Otherwise, return is simply mean(all_daily_returns)*252.

    Returns
    -------
    float
        Annual returns.

    """
    if returns.size < 1:
        return np.nan

    if style == 'calendar':
        num_years = len(returns) / 252.0
        df_cum_rets = cum_returns(returns, starting_value=100)
        start_value = df_cum_rets[0]
        end_value = df_cum_rets[-1]
        return ((end_value - start_value) / start_value) / num_years
    if style == 'compound':
        return pow((1 + returns.mean()), 252) - 1
    else:
        return returns.mean() * 252


def annual_volatility(returns):
    """
    Determines the annual volatility of a strategy.

    Parameters
    ----------
    returns : pd.Series
        Daily returns of the strategy, noncumulative.
         - See full explanation in tears.create_full_tear_sheet.

    Returns
    -------
    float
        Annual volatility.
    """

    if returns.size < 2:
        return np.nan

    return returns.std() * np.sqrt(252)


def sharpe_ratio(returns, returns_style='compound'):
    """
    Determines the Sharpe ratio of a strategy.

    Parameters
    ----------
    returns : pd.Series
        Daily returns of the strategy, noncumulative.
         - See full explanation in tears.create_full_tear_sheet.
    returns_style : str, optional
        See annual_returns' style

    Returns
    -------
    float
        Sharpe ratio.

    Note
    -----
    See https://en.wikipedia.org/wiki/Sharpe_ratio for more details.
    """

    numer = annual_return(returns, style=returns_style)
    denom = annual_volatility(returns)

    if denom > 0.0:
        return numer / denom
    else:
        return np.nan

Strategy Sharpe Ratios with Different Slippage Assumptions.

Minute bar data is used here so this is not a true Sharpe ratio, but the resulting figures are still very informative.

In [63]:
sharpes = results.apply(sharpe_ratio)

fig = sharpes.plot(title="Sharpe Ratio Vs. Holding Period (No Slippage)")
fig.set_xlabel('Holding period (minutes)')
fig.set_ylabel('Sharpe Ratio')
Out[63]:
<matplotlib.text.Text at 0x7f233faa1990>
In [49]:
def adjust_for_slippage(returns, turnover, basis_points=20):
    """
    Adjusts a return series for slippage based on the
    portfolio turnover rate and an assumed number 
    of basis points of slippage.
    
    Parameters
    ----------
    returns : pd.Series
        backtest daily returns series
    turnover : pd.Series or float
        strategy turnover rate as a series or value.
    basis_points : int/float
        slippage assumption of average market impact.
        
    Returns
    -------
    Series
        slippage adjusted returns
    """
    slippage = 0.0001 * basis_points
    return returns - slippage * turnover

slippages = np.linspace(0, 60)

adjusted_sharpes = pd.DataFrame({
    s: adjust_for_slippage(results, turnover, s).apply(sharpe_ratio)
    for s in slippages
})
In [50]:
fig = adjusted_sharpes.T.plot(title="Sharpe Ratios with different Slippage Assumptions")
fig.set_xlabel('Basis Points of Slippage')
fig.set_ylabel('Sharpe Ratio')
Out[50]:
<matplotlib.text.Text at 0x7f233dce6190>

Backtest Analysis

This backtest rebalanced on every bar with no slippage or commissions paid. By not including transaction costs in the backtest you can do a sensitivity analysis in research afterward.

In [22]:
backtest = get_backtest("55c20ef679265f0c6713ddd8")
100% Time: 0:15:15|###########################################################|
In [51]:
def get_backtest_turnover(transactions, portfolio_values):
    """
    Returns the daily portfolio turnover defined 
    as the average value of buys & sells over the
    ending portfolio value.
    
    Parameters
    ----------
    transactions : pd.DataFrame
        The BacktestResult.transactions attribute
    portfolio_values : pd.Series
        end of day portfolio values for the backtest

    Returns
    -------
    Series
        daily portfolio turnover
    """
    prices = transactions.price
    amounts = transactions.amount
    sizes = prices * amounts.abs()
    daily_value = sizes.groupby(sizes.index).sum()
    # Divide by 2 because it's the avg of buys and sells
    turnover = daily_value / portfolio_values / 2.
    return turnover

turnover = get_backtest_turnover(
    backtest.transactions, 
    backtest.daily_performance.ending_portfolio_value
)
returns = backtest.daily_performance.returns
In [54]:
turnover.plot()
print "Average Daily Turnover is {} times the portfolio value!".format(round(turnover.mean()))
Average Daily Turnover is 192.0 times the portfolio value!
In [55]:
with_slippage = pd.DataFrame({
    s: adjust_for_slippage(returns, turnover, s) 
    for s in slippages
})
In [56]:
fig = with_slippage.apply(sharpe_ratio).plot(title='Backtest Sharpe with Slippage')
fig.set_xlabel('Basis Points of Slippage')
fig.set_ylabel('Sharpe Ratio')
Out[56]:
<matplotlib.text.Text at 0x7f2353f8a8d0>

The Sharpe goes back up because volatility gets low when you have no money left.

Commission Cost Sensitivity

Commission costs can be lumped in with slippage, but it is still useful to see how much you can pay before the costs outweigh your PNL.

In [57]:
def get_transaction_commission(txns, per_share=0.005):
    amounts = txns.amount.abs().groupby(txns.amount.index).sum()
    return amounts * per_share

commissions = pd.DataFrame({
    i: get_transaction_commission(backtest.transactions, i)
    for i in np.linspace(0, 0.001)
})
In [58]:
adj_pnl = backtest.daily_performance.pnl - commissions
In [61]:
fig = adj_pnl.mean().plot(title="Average Daily PNL Vs. Commision/Share Paid")
fig.set_xlabel('Commission Per Share ($)')
fig.set_ylabel('Average Daily PNL')
Out[61]:
<matplotlib.text.Text at 0x7f238e4f9410>

As you can see HFT algorithms are not feasible for retail traders who have to pay significant commissions to their brokerage.

Conclusion

Quantopian's research environment opens up the possibilities for analyzing trading strategies. We can easily tinker and visualize data as well as do sensitivity analyses to determine their feasibility.

I hope you've found this informative and it helps you intelligently analyze your own strategies.

David Edwards

In [ ]: