Notebook

Labeling Data for Financial Machine Learning

In [2]:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
In [3]:
start_date = pd.datetime(2017,6,13)
end_date   = pd.datetime(2018,7,13)
prices = get_pricing('SPY', start_date, end_date, frequency='daily').tz_convert('US/Eastern')
prices['close_price'].plot()
plt.title('Daily Price of SPY')
plt.xlabel('Date')
plt.ylabel('Price');
In [4]:
fig, ax = plt.subplots()

ax.plot(prices['close_price'], 'b-')
ax.set_yaxis = prices['close_price'].values
plt.title('Representation of Triple Barrier Method')
plt.xlabel('Date')
plt.ylabel('Price')

plt.plot([736660, 736660], [247, 270], 'r-')  # Initiation of trade
plt.plot([736660, 736800], [247, 247], 'r--') # Lower Barrier, here $160
plt.plot([736660, 736800], [270, 270], 'r--') # Upper Barrier, here $180
plt.plot([736800, 736800], [247, 270], 'r-') # Vertical Barrier, here 140 days after initiation
plt.show()

Here, the upper barrier is hit at a price of $270 in early January, resulting in a positive outcome (profits taken). This would result in a positive label for a long trade initiated at that start date in November.

To calculate the width of our horizontal barriers, we use the approximate volatility of the securities we are observing and labeling. The measure we use for this is an exponentially-weighted moving standard deviation of returns (default is a halflife of 20 days). This gives a decently long-term view on the volatility of a security, sufficient for setting limit orders on short-term transactions. The specific values depends on risk tolerance. The equation for the exponentially weighted moving standard deviation is below, with a link to documentation here:

$$y*t = \frac{\sum*{i=0}^t w*ix*{t-1}}{\sum_{i=0}^t w_i}$$

Where $x_t$ is the returns on a particular day, and $w_i$ is one of the weights (as determined by the pandas.ewm() function). This gives an exponentially weighted moving average of returns, which we then take the standard deviation of to derive the estimated volatility for our stock on a given day.

In [5]:
def get_daily_vol (securities, start_date, end_date, lookback=20):
    '''
    Returns the daily volatility of all stocks in the close_prices Dataframe, reindexed to close
        -Used to calculate daily stop loss/profit taking thresholds
        -Calculated as the exponentially weighted moving standard deviation (EWM) of returns over lookback days
        -Output will begin 3 days after start_date

    Parameters
    ----------
    securities : list
        -List of symbol objects for pricing data to be derived from
        -Example: '[Equity(5061 [MSFT]), Equity(24 [AAPL]), Equity(49311 [AQMS])]'
                
    start_date : str
        -Starting date for pricing data to be collected from
        -First value for volatility will be lookback trading days after start_date
        
    end_date : str
        -Ending date for pricing data to be collected from
        
    lookback : float, optional, default 20 bars
        -The number of trading days used by the EWM function to calculate the given day's volatility.
        
    Output
    ----------
    daily_vol : pd.DataFrame
        -Daily volatility of a universe of stocks (columns) over some series of trading days (indices)
        -First value will be NaN
        -Example:
                
                                     MSFT        AAPL    
        2016-11-02 00:00:00+00:00    0.011824    0.012861
        2016-11-03 00:00:00+00:00    0.011807    0.012919
        2016-11-04 00:00:00+00:00    0.011844    0.012953
        2016-11-07 00:00:00+00:00    0.012097    0.013016
    '''
    # Gets pricing data for all securities over timespan
    close_prices = get_pricing(securities, start_date, end_date, frequency='daily', fields='close_price')
    
    # pd.offsets.Bday() * lookback
    # Calculate returns with .pctchange
    
    # Sets daily_vol as a series of index dates with values of following dayes
    daily_vol = close_prices.index.searchsorted(close_prices.index - pd.Timedelta(days=1))
    daily_vol = daily_vol[daily_vol>0]
    daily_vol = pd.Series(data=close_prices.index[daily_vol - 1], 
                          index=close_prices.index[close_prices.shape[0]-daily_vol.shape[0]:])
    
    # Uses that structure to simply calculate daily returns, if no duplicate indices
    try:
        daily_vol = (close_prices.loc[daily_vol.index] / 
            close_prices.loc[pd.to_datetime(daily_vol.values, utc=True)].values-1) # daily returns
    except Exception as e:
        print('error: {}\nplease confirm no duplicate indices'.format(type(e).__name__, e))
        
    # Calculates the exponentially weighted moving standard deviation of returns
    daily_vol = daily_vol.ewm(span=lookback).std()
    
    # Skips the first value (always NaN) and returns
    return daily_vol.iloc[1:]
In [6]:
plt.plot(get_daily_vol('SPY', start_date, end_date));
In [7]:
plt.plot(prices['close_price']);

Below is the code for the triple-barrier method (docstring including explanations and an example).

In [92]:
def TripleBarrierMethod(securities, start_date, end_date, upper_lower_multipliers, t_final=10):
    '''
    Simulates path-dependent stop loss / profit taking behavior for given securities
    
    Note: Based on CLOSE PRICES, for low-frequency data this really fails to maintain the intuition behind stop/limit orders
    Constructs up to three barriers (any can be disabled):
        -Upper horizontal barrier: the price of a profit taking order
        -Lower horizontal barrier: the price of a stop loss order
        -Rightward vertical barrier: the time where the instrument is sold regardless of performance
    Treats each security on each bar independently
    Output is in range [-1, 1] for each date/security.
    If that security was bought on that date, if the stop loss is hit by t_final days, -1.
    Profit taking order, +1. If the vertical barrier hit, in range (-1, 1) depending on how close it finished to each horizontal barrier
    If no barrier is hit, NaN output.
    Output will be delayed by enough days to calculate the volatilities (default 100, can be change)
    
    Parameters
    ----------
    securities : list
        -List of symbol objects for pricing data to be derived from
        -Example: '[Equity(5061 [MSFT]), Equity(24 [AAPL])'
                
    start_date : str
        -Starting date for pricing data to be collected from
        -First value for volatility will be lookback trading days after start_date
        
    end_date : str
        -Ending date for pricing data to be collected from
    
    upper_lower_multipliers : list, non-negative float
        upper_lower_multipliers[0]: Factor multiplied by daily_vol to set width of upper barrier
        upper_lower_multipliers[1]: Factor multiplied by daily_vol to set width of lower barrier
        0 to disable either barrier
        
    t_final : integer
        The static number of days from each purchase to set the vertical barrier. 0 to disable, 10 by default.
        
    Output
    ----------
        out : DataFrame (index = dates, values = [-1, 1] for each security)
            For the given start-date (index), did each security (column header) tested either
                -Hit the top barrier first? Set value to 1
                -Hit the bottom barrier first? Set value to -1
                -Hit the vertical barrier first? Set value to weighted return over timespan
                    (As end price approaches a barrier, approaches 1 or -1)
                -Hits no barrier before the data runs out? NaN
            -Example:
            
                                         MSFT        AAPL
        2016-11-02 00:00:00+00:00          -1           1
        2016-11-03 00:00:00+00:00           1           1
        2016-11-04 00:00:00+00:00    0.401802           1
        2016-11-07 00:00:00+00:00           1         NaN
                
    '''
    # Daily EWMSTD volatility for date range (not including first 3 trading days)
    # Default is 20 day EWM
    daily_vol = get_daily_vol(securities, start_date, end_date)
    
    # Daily prices for securities on time-span (uses vol's start date, data before vol available useless)
    prices = get_pricing(securities, daily_vol.index[0], pd.to_datetime(end_date, utc=True), frequency='daily', fields = ['high', 'low', 'close_price'])
    close_prices = prices.loc['close_price']
    highs = prices.loc['high']
    lows = prices.loc['low']
    # Output uses vol's index for same reason as above
    out = pd.DataFrame(index = daily_vol.index)
    
    # Iterates over daily_vol for all dates (even those less than t_final days before end_date)
    # That becomes important later
    for day, vol in daily_vol.iterrows():
        # Total # of days passed from the start of analysis (not start date)
        days_passed = len(daily_vol.loc[daily_vol.index[0] : day])
        
        # Sets vertical barrier to t_final days after current day, only if enough days remain in dataset
        # Otherwise, set it to NaN
        if (days_passed + t_final < len(daily_vol.index) and t_final != 0):
            vert_barrier = daily_vol.index[days_passed + t_final]
        else:
            vert_barrier = np.nan
            
        # If the top barrier is set to exist, set it to day's close price + that price, times the predicted vol, times the multiplier
        # Otherwise, set it to NaN. Applies to all securities on this day (uses each security's vol)
        if upper_lower_multipliers[0] > 0:
            top_barrier = close_prices.loc[day] + close_prices.loc[day] * upper_lower_multipliers[0] * vol
        else:
            top_barrier = pd.Series(index=close_prices.index) # NaNs
            
        # Same for bottom barrier
        if upper_lower_multipliers[1] > 0:
            bot_barrier = close_prices.loc[day] - close_prices.loc[day] * upper_lower_multipliers[1] * vol
        else:
            bot_barrier = pd.Series(index=close_prices.index) # NaNs

        # Iterate over all securities
        for security in close_prices.columns:
            # Default break date for security is the vertical barrier (even if NaN)
            breakthrough_date = vert_barrier
            
            # For t_final days after current date (or remaining days in time_frame, whichever ends first)
            for future_date in daily_vol.index[days_passed : min(days_passed + t_final, len(daily_vol.index))]:
                
                # If other barriers were broken on current date, set breakthrough_date to it
                if ((highs.loc[future_date][security] >= top_barrier[security] or 
                     close_prices.loc[future_date][security] >= top_barrier[security] and
                     top_barrier[security] != 0)):
                    out.at[day, security] = 1
                    breakthrough_date = future_date
                    break
                elif (lows.loc[future_date][security] <= bot_barrier[security] or
                      close_prices.loc[future_date][security] <= bot_barrier[security] and 
                      bot_barrier[security] != 0):
                    out.at[day, security] = -1
                    breakthrough_date = future_date
                    break
            
            if (breakthrough_date == vert_barrier):
                # Initial and final prices for security on timeframe (purchase, breakthrough)
                price_initial = close_prices.loc[day][security]
                price_final   = close_prices.loc[breakthrough_date][security]
                
                if price_final > top_barrier[security]:
                    out.at[day, security] = 1
                elif price_final < bot_barrier[security]:
                    out.at[day, security] = -1
                else:
                    out.at[day, security] = max([(price_final - price_initial) / (top_barrier[security] - price_initial),
                                             (price_final - price_initial) / (price_initial - bot_barrier[security])], key=abs)
    
    # Purge last value; a trade can never be initiated here (trades made at close)
    return out[:-1]

Sample output using tight constraints for the volatility (be sure to use a list for the input securities)

In [93]:
out = TripleBarrierMethod(['SPY'], start_date, end_date, [2, 2], t_final=10)
plt.plot(out, 'bo');

Output showing label values for trades initiated on the given date. We see lots of positive labels early in our dataset, which makes sense given the bullish conditions of the time. In 2018 it appears a bit more middling.

In [13]:
close = prices['close_price']
start = close.index[143].to_pydatetime()
end   = close.index[143 + 10].to_pydatetime()
vol   = get_daily_vol('SPY', start_date, end_date).loc[start]
upper_barrier = close.loc[start] + close.loc[start] * 2 * vol
lower_barrier = close.loc[start] - close.loc[start] * 2 * vol
plt.plot(prices['close_price'][125:175]);

plt.plot([start, end], [upper_barrier, upper_barrier], 'r--');
plt.plot([start, end], [lower_barrier, lower_barrier], 'r--');
plt.plot([start, start], [lower_barrier, upper_barrier], 'r-');
plt.plot([end, end], [lower_barrier, upper_barrier], 'r-');

Example output of the Triple-Barrier Method visualized for a specific day. This should give +1 as the label for a trade initiated on SPY on that date.

In [11]:
print(out.loc[start][0])
1.0

And it does!

So, the method works, though checking more rigorously would be necessary before use in a trading system, but what are some problems with its current implementation?

-Frequency: This notebook uses a daily frequency, though the code will work just as well for minutely. Even minutely bars will not provide enough data to learn from, at least in comparison to traders that make use of higher frequencies.

-Data type: We would also prefer to use a data type that exhibits better statistical properties, as discussed in previous posts. This would still make use of daily volatility, but would require the Triple-Barrier Method function to accept pricing data of other bar types, rather than calling the get_pricing function.

-Non-independent labels: Yep, it's IID/stationarity again. Our labels exhibit high degrees of autocorellation for fairly obvious reasons, primarily because they depend on overlapping price paths. An upcoming notebook will show how to manage labels to ensure that they allow our algorithm to draw meaningful conclusions.

In the meantime, be wary of using the labels directly output from this method to train an algorithm!

Happy coding, and thanks again to Dr. Lopez de Prado for his contribution to the fields of machine learning and asset management.