Notebook
In [1]:
from scipy.stats import norm, t, jarque_bera
from statsmodels.tsa.stattools import adfuller as adf
from statsmodels.stats.diagnostic import het_breushpagan as bp
from statsmodels.tsa.stattools import acf
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

1-1. Analysis of Normal Data

Below is an introduction to the language and tools for analysis of normal data that will be used in the rest of this notebook. Our normality test is the Jarque-Bera, and we will be using the size of its test statistic as a sample-size adjusted measure of how close to normal the distribution of our data is. The null hypothesis of the JB test is that the data are from a normal distribution. In frequentist statistics, we can never accept the null. If the p-value is zero, smaller JB test statistics indicate greater degrees of normality, while still being decidedly non-normal. If the p-value is non-zero, and fluctuates with the test statistic as the test is run multiple times, then the test has failed to reject the null hypothesis. This is pretty good evidence, but by definition not proof, that our data are at least quite close to normal.

In [2]:
y = norm(loc=0., scale=1.0) 
x = y.rvs(size=200000)
ax = sns.distplot(x, fit=norm, kde=False) # Compare to 20,000 samples from that distribution
plt.title('Normal Distribution vs. Normal Distribution Samples')
plt.xlabel('Value'); plt.ylabel('Frequency');

print('Jarque-Bera Test Results', jarque_bera(x)) # Should be random (null hypothesis true)
('Jarque-Bera Test Results', (0.63782690716009427, 0.72693846050521216))

When normal data are tested on, we see a chi-squared distribution of both our test statistic (first return value) and p-values (second return value). (Well, they just appear random here. If you were to run that cell dozens of times it would show the distribution). This is to be expected, as the null hypothesis can never be confirmed via a frequentist test.

In [3]:
x = t.rvs(size=200000, df=10) # Sample from a t-distribution with 20 degrees of freedom
ax = sns.distplot(x, fit=norm, kde=False)
plt.title('Normal Distribution vs. T Distribution Samples'); plt.xlabel('Value'); plt.ylabel('Frequency');

print('Jarque-Bera Test Results', jarque_bera(x)) 
# Should have high test statistic and low p (null hypothesis very much false)
# t distributions look nearly normal, but have fatter tails. This test picks up on it.
('Jarque-Bera Test Results', (6545.4903472896785, 0.0))

Here, the test statistic is massive and the p-value is 0. This is indicative of highly non-normal data combined with a high sample size. Greater sample sizes -> greater test statistics. When comparing test statistics across datasets, it is vital to keep sample sizes consistent. This also shows that our test is robust where the human eye may not be; at a first glance that t-distribution appears pretty normal, but the test shows otherwise. Similar to the Jarque-Bera test for normality, we will be using the Augmented Dickey-Fuller test for stationarity, and the Breusch-Pagan test for heteroskedasticity.

Stationarity is the property described in the introduction. Time series data is stationary if its probability distribution does not change over time, meaning that its mean and variance are constant. This is an assumption in much of statistical analysis, including the vast majority of machine learning models.

Heteroskedasticity is a... difficult to define property. It describes a situation where the variance of a variable varies with respect to another variable... yea. Perhaps a better illustration would be comparing the distribution of annual incomes for people differing in age. The majority of 17 year olds are minimum wage workers, with perhaps a few outliers, but largely centered around one value. 50 year olds, on the other hand, can range from minimum wage workers to the highest paid CEOs in the world. The former having low variance, the latter having much greater. Here, one could say that income is heteroskedastic with respect to age. In a financial context, we would aspire to work with homoskedastic data, as this property would provide further evidence for stationarity (constant variance).

Finally, autocorrelation (or serial correlation) is the correlation of a signal with a lagged version of itself. Pricing data is highly autocorrelated (today's price is closely related to yesterday's), whereas returns may be less so. The existence of autocorrelation makes machine learning on time series data extremely difficult or impossible. Thus, minimizing it will improve the power of our resultant machine learning algorithm.

2: Financial Data Structures

The chief theory of this notebook is that different representations of pricing data exhibit different statistical properties. This is non-obvious, as there is only one true historical path the price has taken, but consider that each bar (observation of prices) can vary wildly depending on how you define it. Minutely or daily bars can vary quite a bit in information contained; some minutes/days have very little price movement or shares traded, while some have wild swings in price and high volume. Different amounts of information are contained in each bar, so we would expect that statistical properties would vary quite a bit between bars. One example of this problem is that the majority of a day's trading activity occurs within a few minutes of the open and close. Minute bars are too low frequency here, and fail to show all of the detail, while being too high frequency around noon when far fewer trades occur.

An alternative is volume bars, indexed instead by the # of shares traded. This solves our problem of varying frequency (more bars are generated with more trading action), however comes with another problem. If 1,000,000 shares are bought of a stock worth \$0.33, not much is actually being invested. If that same stock rises or falls sharply in price, then 1,000,000 shares could mean much more interest is being given to the stock. The volume bar model fails to take this into account, and leads to inconsistency in statistical properties across price changes in the stock. A solution to this problem? Dollar bars, indexed by the amount of value traded (number of shares traded * price per share). This should, theoretically, contain all pertinent information about price and trading activity in a consistent manner across time.

There are other interesting bar types, particularly those described in Advances in Financial Machine Learning. Information driven bars are designed to "sample more frequently when new information arrives to the market." These are outside of the scope of this notebook, but largely focus on creating a bar when price/volume/value changes diverge from the expectation in a bar. This helps solve the problem that all of our bar types are susceptible to creating many observations for slow, sideways trading (though volume/dollar bars at least attenuate this a bit).

We expect the more complex bar types to better exhibit the above described properties (normality, stationarity, homoskedasticity, autocorrelation). Perfect normality, stationarity etc are not expected, however the closer our dataset is, the stronger our machine learning algorithm will be.

3: Implementation and Analysis

Now let us use our analysis of normal data (with added stationarity testing) on the aforementioned data structures. Before we can begin, however, we need to construct our bars. Ideally we would use tick data (base data where each transaction is indexed alone), but Quantopian's minutely pricing information should provide a good base to build on. Let's get into it.

3-1. Time Bars

Bars indexed by time intervals, minutely, daily, etc. OHLCV (Open, High, Low, Close, Volume) is standard. Taken straight from Quantopian.

In [4]:
securities = symbols('SPY') #Select the S&P500 index for our analysis
start_date = '2017-06-13'   #13 months of minutely pricing data
end_date   = '2018-07-13'
frequency  = 'minute'
In [5]:
prices = get_pricing(securities, start_date, end_date, frequency=frequency, fields=['open_price', 'high', 'low', 'close_price', 'volume']).tz_convert('US/Eastern')
prices['close_price'].plot()
plt.title('Minutely Price of SPY'); plt.xlabel('Date'); plt.ylabel('Price');
In [6]:
returns = prices['close_price'].pct_change().dropna() # Minutely returns
returns.plot()
plt.ylim(-0.021, 0.021) # Standardized so we can compare to later graphs
plt.title('Minutely Returns of SPY'); plt.xlabel('Date'); plt.ylabel('Returns');

Some notable behaviour here, particularly that the returns appear more or less centered at 0 with fairly consistent variance. At the start of 2018 this changed a bit, with far greater volatility exhibited in returns. Stationarity implies that our observations have the same statistical properties (mean and variance) over time, so we have some problems there.

In [7]:
# Histogram + distribution of returns
ax = sns.distplot(returns, fit=norm, kde=False)
plt.title('Distribution of Minutely SPY Returns')
plt.xlabel('Minutely Return')
plt.ylabel('Instances');

print('Sample size', len(returns))
print('Mean: ', returns.mean())
print('Variance: ', returns.std())
print('Jarque-Bera Test Results', jarque_bera(returns))
print('Augmented Dickey-Fuller Test Results', adf(returns, maxlag=1)[0:2])
('Sample size', 106319)
('Mean: ', 1.570338787545689e-06)
('Variance: ', 0.00037937722708803845)
('Jarque-Bera Test Results', (230441038.96251607, 0.0))
('Augmented Dickey-Fuller Test Results', (-237.09499091972043, 0.0))

The appearance of outliers make the results visually confusing, but the extreme test statistics imply our data are highly non-IID. Note, the very high sample size here inflates the test statistic for the JB test; when using test statistics to compare relative closeness to satisfying the null hypothesis, the sample sizes must be the same. This will become important later, as we compare these results to those for other bars' returns.

3-2. Tick Bars

Ticks refer to individual transactions on the market. Each tick involves some number of shares being traded hands, at a certain time and price. Bid and ask prices of the paired orders are also included, as well as the exchange the trade ocurred on. This is the highest frequency dataset possible, and is useful for constructing other bar types as well as high-frequency analysis. Unfortunately, this dataset is also very rich and often expensive and/or rife with errors, so finding a good source is vital. We do not provide this data on the platform, however there is a strong analysis of different bar types from Lopez de Prado's book, inclusive of a link to a free short-term sample of tick data. We can use Quantopian's minutely data to closely approximate other bar types over longer periods of time, with some tricks.

3-3. Volume Bars

Bars indexed by total volume, with each set # of shares traded creating a distinct bar. We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars to maintain information for all parameters across bars. Let's set a target bar volume equal to the maximum minutely volume in our dataset (otherwise we'd have to split minute bars, which we can't). Then, we construct our bars in such a way as to minimize the distance from this target:

  1. Combine minute bars while maintaining the time of the first and final minutes, adding volume of bars together, and keeping track of the high and low over the interval.
  2. At each minute, check if the difference from the target volume would be minimized by adding or excluding the next minute. Select the optimal option.

This is meant to prevent the scenario where your target volume is 100K shares, and 2 bars of 99.9K shares follow each other. Adding the bars would result in essentially twice our target, while this strategy would result in 2 volume bars much closer to the target. This regularizes the volume of our bars.

In [8]:
# Since our base data are not ticks, we cannot infer our variables [OHLCV] if a bar is split
# So the target volume to set a bar is set to the maximum value on our interval,
# Each volume bar consists of at least 1 full time bar. This causes some volume inequity between bars, best we can do
tgt_volume  = prices['volume'].max()

# Indices in time bars where volume bars should end
bar_ends  = []
bar_volume = 0
skip = False # Ugly variable, used to skip an index in the loop if the 2nd volume is used

# Iterates over each pair of neighboring minute bars' volumes, selects the option that minimizes difference from target
for index, (volume1, volume2) in enumerate(zip(prices['volume'], prices['volume'][1:])):
    # Skips this iteration if the index was used already
    if (skip == True):
        skip = False
        continue
    
    # If the target is surpassed in our pair, selects the difference minimizing index
    if (bar_volume + volume1 + volume2 > tgt_volume):
        d1 = abs(tgt_volume - (bar_volume + volume1))
        d2 = abs(tgt_volume - (bar_volume + volume1 + volume2))
        
        # Includes 2nd index, sets the loop to skip the next iteration
        if (d1 > d2):
            bar_ends.append(index + 1)
            bar_volume = 0
            skip = True
        
        # Excludes the 2nd index, business as usual
        else:
            bar_ends.append(index)
            bar_volume = 0
            
    # If the pair wouldn't result in a cross of the target, just add the volume of the first index
    else:
        bar_volume += volume1

# Transfers and combines information from time bars into volume bars
volume_bars = prices.iloc[bar_ends]
bar_count  = 0
bar_volume = 0
min_price  = prices.iloc[0]['low']
max_price  = prices.iloc[0]['high']
open_price = prices.iloc[0]['open_price']
close_price= prices.iloc[0]['close_price']
for timestamp in prices.index:
    bar_volume += prices.loc[timestamp]['volume']
    min_price = min(min_price, prices.loc[timestamp]['low'])
    max_price = max(max_price, prices.loc[timestamp]['high'])
    
    if timestamp in volume_bars.index:
        volume_bars.loc[timestamp]['open_price'] = open_price
        volume_bars.loc[timestamp]['high']       = max_price
        volume_bars.loc[timestamp]['low']        = min_price
        close_price = prices.loc[timestamp]['close_price']
        volume_bars.loc[timestamp]['close_price']= close_price
        volume_bars.loc[timestamp]['volume']     = bar_volume
        
        bar_volume  = 0
        bar_count  += 1
        if bar_count < len(volume_bars):
            open_price  = prices.loc[volume_bars.iloc[bar_count].name]['open_price']

Below is an implementation from Michael Matthews in this forum post, that runs much more quickly (and elegantly!) than the above, at the cost of accuracy.

In [9]:
def volume_bars_vectorized(ohlcv, volume_threshold):
    """Create 'volume-base' activity ohlc bars using pandas and numpy to 
    make computations more efficient.
    
    Parameters
    ----------
    ohlcv: pd.DataFrame
        columns = open_price, high, low, close, volume,  index = datetime
    volume_threshold: int
        Number of shares traded per bar
        
    Returns
    -------
    pd.DataFrame
        DataFrame containing OHLCV data. Indexed by datetime at end of bar.
    """
    cum_vol = ohlcv['volume'].cumsum()
    grouper = cum_vol // volume_threshold
    
    # This makes sure last minute bar is included in aggregation
    mask = grouper != grouper.shift(1)
    mask[0] = False
    grouper = (grouper - mask.astype(int) ).values
  
    volume_ohlcv = (ohlcv.reset_index().groupby(grouper)
                    .agg({'open_price': 'first', 'high': 'max', 
                          'low': 'min', 'close_price': 'last', 
                          'volume': 'sum', 'index': 'last'})).set_index('index')
    volume_ohlcv = volume_ohlcv[['open_price', 'high', 'low', 'close_price', 'volume']]
    volume_ohlcv.index.name=None
    return volume_ohlcv

prices.groupby(prices.index.date)['volume'].sum().mean(), tgt_volume
volume_bars_ = volume_bars_vectorized(prices, tgt_volume)
In [10]:
ax1 = plt.subplot2grid((2, 1), (0, 0), colspan=1, rowspan=1)
ax1.set_ylim(0.2e7, 1.2e7)
ax2 = plt.subplot2grid((2, 1), (1, 0), colspan=1, rowspan=1)
ax2.set_ylim(0.2e7, 1.2e7)

ax1.plot(volume_bars['volume'])
ax2.plot(volume_bars_['volume'])
Out[10]:
[<matplotlib.lines.Line2D at 0x7f3247a2bdd0>]

We can see that the volatility of volume around our target is much greater in the vectorized solution. Both implementations suffer from look-ahead bias (a solution would involve setting the volume target as some multiple of past volume, with an exception case if one minute's volume exceeds this target), and both accurately model price changes. One runs extremely quickly, one more accurately approaches our volume target. If you are incorporating volume bars into your research, pick whichever you prefer. We will be continuing with the slower solution for this analysis.

In [11]:
plt.plot(volume_bars['close_price'])
plt.title('SPY Volume Bars')
plt.xlabel('Date')
plt.ylabel('Price');

Little information was lost in our sampling method (on long timescales at least); the price behaviour remains clear.

In [12]:
plt.plot(volume_bars['volume'])
plt.title('Volume per SPY Volume Bar')
plt.xlabel('Date')
plt.ylabel('Volume (Shares Traded)');

That's pretty spiky, but decidedly centered at our target. Theoretical worst case volume bar multiples should be 0.5x and 1.5x our target, less if we use a higher multiple of max volume. Let's see if we constructed this properly.

In [13]:
print('Target Volume: ', tgt_volume)
print('Mean Volume: ', volume_bars['volume'].mean())
print('Max Volume Multiple: ', volume_bars['volume'].max()/tgt_volume)
print('Min Volume Multiple: ', volume_bars['volume'].min()/tgt_volume)
print('# Under: ', len(volume_bars[volume_bars['volume'] < tgt_volume]))
print('# Over : ', len(volume_bars[volume_bars['volume'] > tgt_volume]))
('Target Volume: ', 6835676.0)
('Mean Volume: ', 6831270.834192038)
('Max Volume Multiple: ', 1.307864650109221)
('Min Volume Multiple: ', 0.71326771485365892)
('# Under: ', 1093)
('# Over : ', 1041)

That's pretty good! Our outliers are about 30% from our target, but the mean is spot on and almost the exact same amount of bars surpass our target as those that fail to meet it. Using tick data would largely eliminate this inefficiency, but this approximation seems pretty good. Let's look at returns, as this is where we want to analyze the statistical properties.

In [14]:
returns = volume_bars['close_price'].pct_change().dropna() #Returns per bar
returns.plot()
plt.ylim(-0.021, 0.021)
plt.title('Returns of SPY per Volume Bar');  plt.xlabel('Date'); plt.ylabel('Returns');

We do still see the variance changing over time, but more spread out. This may actually prove problematic in our stationarity tests, as a greater proportion of our data exhibits 'abnormal' variance. The tradeoff is that it may better describe the trends in that generally volatile time period. Ah well, let's see.

In [15]:
# Histogram + distribution of returns
ax = sns.distplot(returns, fit=norm, kde=False)
plt.title('Distribution of SPY Returns per Volume Bar')
plt.xlabel('Return')
plt.ylabel('Instances');

print('Sample size: ', len(returns))
print('Mean: ', returns.mean())
print('Variance: ', returns.std())
print('Jarque-Bera Test Results', jarque_bera(returns))
print('Augmented Dickey-Fuller Test Results', adf(returns, maxlag=1)[0:2])
('Sample size: ', 2134)
('Mean: ', 7.790707640117826e-05)
('Variance: ', 0.0025277665785043892)
('Jarque-Bera Test Results', (1213.706690591207, 0.0))
('Augmented Dickey-Fuller Test Results', (-31.385993912662759, 0.0))

This is a bit more convincing, but has a significantly smaller sample size than our minute bars so we can't directly compare test statistics. Let's see how this compares to dollar bars.

3-4. Dollar Bars

Bars indexed by value exchanged (number of shares traded times price they were traded at). Again, time bars are poor for this, and we must approximate the price for each order as well. Let's use the maximum daily dollar value of all trades as the target value to create a new bar.

A similar vectorized solution could also be implemented for dollar bars, but let's continue with the more constrained implementation

In [16]:
# Replaces the volume column with a value column
dollar_bars = prices
mean_price_est = (dollar_bars['high'] + dollar_bars['low']) / 2
tgt_value   = (dollar_bars['volume'] * mean_price_est).dropna().max()
dollar_bars['value'] = dollar_bars['volume'] * mean_price_est
dollar_bars.drop('volume', axis=1)


# Same optimization method as in volume bars
bar_ends  = []
bar_value = 0
skip = False
for index, (value1, value2) in enumerate(zip(dollar_bars['value'], dollar_bars['value'][1:])):
    if (skip == True):
        bar_value = 0
        skip = False
        continue
        
    if (bar_value + value1 + value2 > tgt_value):
        d1 = abs(tgt_value - (bar_value + value1))
        d2 = abs(tgt_value - (bar_value + value1 + value2))
        if (d1 > d2):
            bar_ends.append(index + 1)
            bar_value = 0
            skip = True
        else:
            bar_ends.append(index)
            bar_value = 0    
    else:
        bar_value += value1
        
dollar_bars = prices.iloc[bar_ends]
bar_count  = 0
bar_value  = 0
min_price  = prices.iloc[0]['low']
max_price  = prices.iloc[0]['high']
open_price = prices.iloc[0]['open_price']
close_price= prices.iloc[0]['close_price']


for timestamp in prices.index:
    volume = prices.loc[timestamp]['volume']
    mean_price_est = (prices.loc[timestamp]['high'] + prices.loc[timestamp]['low']) / 2
    bar_value += volume * mean_price_est
    min_price = min(min_price, prices.loc[timestamp]['low'])
    max_price = max(max_price, prices.loc[timestamp]['high'])
    
    if timestamp in dollar_bars.index:
        dollar_bars.loc[timestamp]['open_price'] = open_price
        dollar_bars.loc[timestamp]['high']           = max_price
        dollar_bars.loc[timestamp]['low']            = min_price
        close_price = prices.loc[timestamp]['close_price']
        dollar_bars.loc[timestamp]['close_price']    = close_price
        dollar_bars.loc[timestamp]['value']          = bar_value
        
        bar_value  = 0
        bar_count  += 1
        if bar_count < len(dollar_bars):
            open_price  = prices.loc[dollar_bars.iloc[bar_count].name]['open_price']
In [17]:
plt.plot(dollar_bars['close_price'])
plt.title('SPY Dollar Bars'); plt.xlabel('Date'); plt.ylabel('Price');

Looks good here, little long-run pricing information lost.

In [18]:
plt.plot(dollar_bars['value'])
plt.title('Dollars per SPY Value Bar'); plt.xlabel('Date'); plt.ylabel('Value (Dollars Traded)');

Pretty similar to volume bars, perhaps a bit more squiffy outside of that early 2018 range. Let's quantify that.

In [19]:
print('Target Value: ', tgt_value)
print('Mean Value: ', dollar_bars['value'].mean())
print('Max Value Multiple: ', dollar_bars['value'].max()/tgt_value)
print('Min Value Multiple: ', dollar_bars['value'].min()/tgt_value)
print('# Under: ', len(dollar_bars[dollar_bars['value'] < tgt_value]))
print('# Over : ', len(dollar_bars[dollar_bars['value'] > tgt_value]))
('Target Value: ', 1827711504.8804998)
('Mean Value: ', 1829227795.1596794)
('Max Value Multiple: ', 1.3768368614506983)
('Min Value Multiple: ', 0.62011611788212939)
('# Under: ', 1030)
('# Over : ', 1047)

Our outliers are a bit further from our target (~38%) but the mean value and ratio of under:over seem excellent, so this is probably about as good as we can get.

In [20]:
returns = dollar_bars['close_price'].pct_change().dropna() #Returns per bar
returns.plot()
plt.ylim(-0.021, 0.021)
plt.title('Returns of SPY per Dollar Bar')
plt.xlabel('Date')
plt.ylabel('Returns');

This may look more wishy-washy, but look carefully at the y-axis. Our returns look decidedly more constrained, but still exhibit strong changes in variance.

In [21]:
# Histogram + distribution of returns
ax = sns.distplot(returns, fit=norm, kde=False)
plt.title('Distribution of SPY Returns per Dollar Bar')
plt.xlabel('Return')
plt.ylabel('Instances');

print('Sample size: ', len(returns))
print('Mean: ', returns.mean())
print('Variance: ', returns.std())
print('Jarque-Bera Test Results', jarque_bera(returns))
print('Augmented Dickey-Fuller Test Results', adf(returns, maxlag=1)[0:2])
('Sample size: ', 2077)
('Mean: ', 8.012907494825048e-05)
('Variance: ', 0.002522539779033094)
('Jarque-Bera Test Results', (679.76470987012738, 0.0))
('Augmented Dickey-Fuller Test Results', (-44.605771423098481, 0.0))

The JB test statistic here is smaller than that of our volume bars, indicating more normal data. However, the sample size is also smaller so that might explain it. Let's normalize our sample sizes and compare again.

4: Comparing Results

In [22]:
tgt_size = min(min(len(volume_bars), len(dollar_bars)), len(prices)) # Smallest sample size from our above tests
scale = len(prices) / tgt_size
prices_ = prices[::scale] # Resamples from our original pricing data
prices_ = prices_[:tgt_size] # Cuts out any end values greater than our target size, accounts for the remainder
price_returns = prices_['close_price'].pct_change().dropna() # Resampled minutely returns

# Same resampling behaviour as above
volume_bars_ = volume_bars[::len(volume_bars) / tgt_size]
volume_bars_ = volume_bars_[:tgt_size]
volume_returns = volume_bars_['close_price'].pct_change().dropna()

# And again
dollar_bars_ = dollar_bars[::len(dollar_bars) / tgt_size]
dollar_bars_ = dollar_bars_[:tgt_size]
dollar_returns = dollar_bars_['close_price'].pct_change().dropna()

# Heteroskedasticity time variables
time_vars = np.arange(price_returns.shape[0])[:, None]
In [23]:
print('Minutely Bars')
print('---------------------------------')
print('Sample size: ', len(price_returns))
print('Jarque-Bera (Normality) Test Results', jarque_bera(price_returns))
print('Augmented Dickey-Fuller (Stationarity) Test Results', adf(price_returns)[0:2])
print('Breusch-Pagan (Heteroskedastisticity) Test Results', bp(price_returns, time_vars)[0:2])
print
Minutely Bars
---------------------------------
('Sample size: ', 2077)
('Jarque-Bera (Normality) Test Results', (6475.139915101382, 0.0))
('Augmented Dickey-Fuller (Stationarity) Test Results', (-10.057593161440458, 1.3600610409127891e-17))
('Breusch-Pagan (Heteroskedastisticity) Test Results', (217.09681021284572, nan))

In [24]:
print('Volume Bars')
print('---------------------------------')
print('Sample size: ', len(volume_returns))
print('Jarque-Bera (Normality) Test Results', jarque_bera(volume_returns))
print('Augmented Dickey-Fuller (Stationarity) Test Results', adf(volume_returns)[0:2])
print('Breusch-Pagan (Heteroskedasticity) Test Results', bp(volume_returns, time_vars)[0:2])
print
Volume Bars
---------------------------------
('Sample size: ', 2077)
('Jarque-Bera (Normality) Test Results', (1222.6831177343588, 0.0))
('Augmented Dickey-Fuller (Stationarity) Test Results', (-31.017293074918889, 0.0))
('Breusch-Pagan (Heteroskedasticity) Test Results', (343.57090659313707, nan))

In [25]:
print('Dollar Bars')
print('---------------------------------')
print('Sample size: ', len(dollar_returns))
print('Jarque-Bera (Normality) Test Results', jarque_bera(dollar_returns))
print('Augmented Dickey-Fuller (Stationarity) Test Results', adf(dollar_returns)[0:2])
print('Breusch-Pagan (Heteroskedasticity) Test Results', bp(dollar_returns, time_vars)[0:2])
Dollar Bars
---------------------------------
('Sample size: ', 2077)
('Jarque-Bera (Normality) Test Results', (679.76470987012738, 0.0))
('Augmented Dickey-Fuller (Stationarity) Test Results', (-44.605771423098481, 0.0))
('Breusch-Pagan (Heteroskedasticity) Test Results', (376.25359902049365, nan))

In these frequentist tests, the p-value (0.0 or nan in the examples above), is the probability with the given test's distribution of generating a test statistic at least as extreme as that derived from our data, assuming the null hypothesis is true. We will be looking at relative test-statistics to determine how extreme our result is, as opposed to the typical method where the p-value is used as a confidence measure. In all of these the null is rejected, but the test-statistic tells us approximately how close to our null hypothesis we are. Frequentist tests are not typically used in this way, but it is a common strategy in Lopez de Prado's work.

Jarque-Bera Test: A goodness of fit test of whether sample data have the skewness and kurtosis of a normal distribution. The null hypothesis is that our distribution is normal. For a given p-value, lower values of our test statistic represent decreasing confidence in the rejection of the null, i.e. a greater degree of normality.

Augmented Dickey-Fuller Test: A frequentist test with the null hypothesis that some unit root is present in our data, indicating a lack of stationarity. The test statistic is negative, with a more negative result indicating a stronger rejection of our null hypothesis, i.e. less indication of stationarity.

Breusch-Pagan Test: A chi-squared test with the null hypothesis that homoskedasticity is exhibited by our data with respect to the given variable. Generally, high values of the test-statistic support a theory of homoskedasticity, which is ideal in financial data, showing that our returns' variance is more constant with respect to time.

To summarize, the null hypotheses are all firmly rejected (p-values of 0 or nan [Chi-Squared distributions like that of the BP test asymptotically approach but never reach 0, so the test outputs a nan instead]), but at different confidence levels. The JB test statistic decreases for each of our constructions, with lower values indicating a greater degree of normality. The ADF test statistic decreases (or grows in the negative direction), indicating a more firm rejection of the null hypothesis of a unit root and a better argument for stationarity in our data. The BP test statistic grows, which in general indicates a greater degree of homoskedasticity.

In [26]:
ax1 = plt.subplot2grid((12, 2), (0, 0), colspan=1, rowspan=3)
ax2 = plt.subplot2grid((12, 2), (0, 1), colspan=1, rowspan=3)
ax3 = plt.subplot2grid((12, 2), (4, 0), colspan=1, rowspan=3)
ax4 = plt.subplot2grid((12, 2), (4, 1), colspan=1, rowspan=3)
ax5 = plt.subplot2grid((12, 2), (8, 0), colspan=1, rowspan=3)
ax6 = plt.subplot2grid((12, 2), (8, 1), colspan=1, rowspan=3)

ax1.plot(price_returns)
ax1.set_ylim(-0.021, 0.021)
ax1.set_title('Minutely Returns of SPY')
ax1.set_xlabel('Date')
ax1.set_ylabel('Returns')

ax2 = sns.distplot(price_returns, fit=norm, ax=ax2, kde=False)
ax2.set_ylim(0, 350)
ax2.set_xlim(-0.01, 0.01)
ax2.set_title('Returns of SPY per Dollar Bar')
ax2.set_xlabel('Return')
ax2.set_ylabel('Instances')

ax3.plot(volume_returns)
ax3.set_ylim(-0.021, 0.021)
ax3.set_title('Returns of SPY per Volume Bar')
ax3.set_xlabel('Date')
ax3.set_ylabel('Returns')

ax4 = sns.distplot(volume_returns, fit=norm, ax=ax4, kde=False)
ax4.set_ylim(0, 350)
ax4.set_xlim(-0.01, 0.01)
ax4.set_title('Returns of SPY per Dollar Bar')
ax4.set_xlabel('Return')
ax4.set_ylabel('Instances')

ax5.plot(dollar_returns)
ax5.set_ylim(-0.021, 0.021)
ax5.set_title('Returns of SPY per Dollar Bar')
ax5.set_xlabel('Date')
ax5.set_ylabel('Returns')

ax6 = sns.distplot(dollar_returns, fit=norm, ax=ax6, kde=False)
ax6.set_ylim(0, 350)
ax6.set_xlim(-0.01, 0.01)
ax6.set_title('Returns of SPY per Dollar Bar')
ax6.set_xlabel('Return')
ax6.set_ylabel('Instances')

plt.tight_layout();

In the three properties tested for above, and this quick graphical representation, we see that our increasingly complex bar constructions are progressively better candidates for a machine learning model. This may seem subtle, but a strong foundation is vital to producing models that pick up on real patterns. Thinking back to the strange analogy for facial recognition across sci-fi human evolution, we need to pick up on what is true for humans now, and humans in millions/billions of years. Just looking at a bunch of faces at one point in time will fail in this, but isolating key factors in millions of faces over long spans of time will allow a clever model to pick up on consistencies. This is our task, and without data transformations like these it is likely an impossible one.

5: Next Steps

Perhaps the most important change that could be made is using tick data as our base structure. This would allow for smaller volume thresholds while maintaining the information in our bars, increasing our sample size and reducing the variability of volume/values between bars. Incorporating information-driven bars as described in Advances in Financial Machine Learning would be interesting as well, but again relies on tick data. Additional bar types might be useful, so posing novel ideas is encouraged!

Additionally, testing for serial correlation as an additional statistic would be useful, however this notebook was already extremely cluttered. Implementing the statistical tests locally, specialized for financial data if possible, would also improve the consistency and power of our conclusions. Additionally, the very idea of using statistical testing in this way is a bit... non-standard. Additional research must be done on the robustness of this strategy in general.

Furthermore, there are still several necessary components to a thorough machine learning platform. Fractional differentiation, as opposed to simple 1-differentiation (prices -> returns) will further allow us to balance our ration of information:stationarity. The triple-barrier method will allow us to label observations for both trade direction and sizing. Finally, the methods in scikit-learn must be adapted to work with our other methods and data structures.

This will result in a model that can make trades, though more work must be done to lay the groundwork to effectively backtest it, nevermind researching an actual signal to trade off of. Unsurprisingly, it is fairly complicated to create a machine learning trading algorithm, but we can at least scratch at the surface and lay the groundwork/software for the research process.

We will be continuing this internally, however any input from the community is highly welcome (Michael's vectorized volume bar solution in particular was excellent, sorry we didn't end up using it but from a programming and runtime perspective it certainly put the one presented here to shame!)

Good luck researching!