Notebook

Naive Bayes High Low Return Prediction Analysis based on Thomas Wiecki 's Post

German Hernandez

Based on the post Machine Learning on Quantopian - Thomas Wiecki in Quantopian

This Notebook uses a Gaussian Naive Bayes model to predict if a stock will have a return n_fwd_days after that will be in the top percentile% of returns (class 1) of the lower percentile% (class -1) using as input variables the returns of 1,2,3,,4,5,6,7,8,9 and 10 days before .

In [1]:
n_fwd_days = 5 # number of days to compute returns over

percentile = 25 # target percetile of the prediction

We use dayly returns of the good quality Quantopian tradable stocks QTradableStocksUS() in a period of between the start and end dates.

In [2]:
from quantopian.pipeline.filters import QTradableStocksUS
universe = QTradableStocksUS()

import pandas as pd
start = pd.Timestamp("2018-05-26")
end = pd.Timestamp("2018-09-26")

We use the Quantopian Pipeline API that allows to build preprocesing filters form multiple stokcs to calcute the decision variables that we want use in trading algorithm.

We import the Returns function form pipeline because our input variables are past returns and our predicted class depends on the n_fwd_days ahead return.

Building the training and testing data set

In [3]:
from quantopian.pipeline.factors import Returns

We define the function make_factors() that define the fucntions that will calculate the input variables for the classfication, in Quantopian the input variables used to make decision in trading algorithsm are called factors

We define a function inside make_factors() for ecah one of the 1,2,3,,4,5,6,7,8,9 and 10 previous returns that we are using as input variables, in order to to this we call Returns(), one of the Built-in Factors in the the [Quantopian Pipeline API] Returns is only called with window_lengt parameter(number of days of caculate te return) so is using the default inputs= [USEquityPricing.close] but returns can be used other inputs like inputs=[USEquityPricing.open]

The function make_factors() returns a list of names and pinter to the fucntions that will be used to buld the pipeline that calculates the input variables.

In [4]:
def make_factors():
    def Asset_Growth_1d():
        return Returns(window_length=2)
    def Asset_Growth_2d():
        return Returns(window_length=3)
    def Asset_Growth_3d():
        return Returns(window_length=4)
    def Asset_Growth_4d():
        return Returns(window_length=5)
    def Asset_Growth_5d():
        return Returns(window_length=6)
    def Asset_Growth_6d():
        return Returns(window_length=7)
    def Asset_Growth_7d():
        return Returns(window_length=8)
    def Asset_Growth_8d():
        return Returns(window_length=9)
    def Asset_Growth_9d():
        return Returns(window_length=10)
    def Asset_Growth_10d():
        return Returns(window_length=11) 
    
    
    all_factors = {
        'Asset Growth 1d': Asset_Growth_1d,
        'Asset Growth 2d': Asset_Growth_2d,
        'Asset Growth 3d': Asset_Growth_3d,
        'Asset Growth 4d': Asset_Growth_4d,
        'Asset Growth 5d': Asset_Growth_5d,
        'Asset Growth 6d': Asset_Growth_6d,
        'Asset Growth 7d': Asset_Growth_7d,
        'Asset Growth 8d': Asset_Growth_8d,
        'Asset Growth 9d': Asset_Growth_9d,
        'Asset Growth 10d': Asset_Growth_10d
    }     

    return all_factors

factors = make_factors()

factors
Out[4]:
{'Asset Growth 10d': <function __main__.Asset_Growth_10d>,
 'Asset Growth 1d': <function __main__.Asset_Growth_1d>,
 'Asset Growth 2d': <function __main__.Asset_Growth_2d>,
 'Asset Growth 3d': <function __main__.Asset_Growth_3d>,
 'Asset Growth 4d': <function __main__.Asset_Growth_4d>,
 'Asset Growth 5d': <function __main__.Asset_Growth_5d>,
 'Asset Growth 6d': <function __main__.Asset_Growth_6d>,
 'Asset Growth 7d': <function __main__.Asset_Growth_7d>,
 'Asset Growth 8d': <function __main__.Asset_Growth_8d>,
 'Asset Growth 9d': <function __main__.Asset_Growth_9d>}

We import the Pipeline function from the Quantopian Pipeline API that build a preprocesing filters from a dictionary of factors names and pointers.

In [5]:
from quantopian.pipeline import Pipeline

We use the Pipeline to define the make_history_pipeline() that will produce the filter that will be applied to obtain build datafarem with the information of the input and target variables.

In [6]:
from quantopian.pipeline.data.builtin import USEquityPricing


def make_history_pipeline(factors, universe, n_fwd_days=5):
    
    # Build dictionary of factors names and definitions used to calculate the information of the input variables 
    factor_ranks = {name: f() for name, f in factors.iteritems()}
    
    # Add to the dictionary the factor name and definitios used to calculate the information of the target variable
    factor_ranks['Returns'] = Returns(inputs=[USEquityPricing.open],window_length=n_fwd_days)
    
    print factor_ranks
    
    pipe = Pipeline(screen=universe, columns=factor_ranks)
    
    return pipe

history_pipe = make_history_pipeline(factors, universe, n_fwd_days=n_fwd_days)

history_pipe
{'Asset Growth 4d': Returns([USEquityPricing.close], 5), 'Asset Growth 5d': Returns([USEquityPricing.close], 6), 'Asset Growth 7d': Returns([USEquityPricing.close], 8), 'Asset Growth 3d': Returns([USEquityPricing.close], 4), 'Returns': Returns([USEquityPricing.open], 5), 'Asset Growth 8d': Returns([USEquityPricing.close], 9), 'Asset Growth 2d': Returns([USEquityPricing.close], 3), 'Asset Growth 9d': Returns([USEquityPricing.close], 10), 'Asset Growth 10d': Returns([USEquityPricing.close], 11), 'Asset Growth 6d': Returns([USEquityPricing.close], 7), 'Asset Growth 1d': Returns([USEquityPricing.close], 2)}
Out[6]:
<zipline.pipeline.pipeline.Pipeline at 0x7fb278707f38>

We import the run_pipeline function from the Quantopian Pipeline API that receives a pipe, a star_date and end_date, and builds data frame with the the information of the input and target variables in that period.

In [7]:
from quantopian.research import run_pipeline

We call run_pipeline with the history_pipe between to between the start and end dates.

In [8]:
from time import time
start_timer = time()
results = run_pipeline(history_pipe, start_date=start, end_date=end)
results.index.names = ['date', 'security']
end_timer = time()
print "Time to run pipeline %.2f secs" % (end_timer - start_timer)
Time to run pipeline 6.33 secs
In [9]:
results.head()
Out[9]:
Asset Growth 10d Asset Growth 1d Asset Growth 2d Asset Growth 3d Asset Growth 4d Asset Growth 5d Asset Growth 6d Asset Growth 7d Asset Growth 8d Asset Growth 9d Returns
date security
2018-05-29 00:00:00+00:00 Equity(2 [ARNC]) 0.012630 -0.002704 -0.010730 -0.005930 0.000000 0.019348 0.015419 0.023591 0.025014 0.005453 0.015969
Equity(24 [AAPL]) -0.000053 0.001966 0.000690 0.006780 0.005117 0.012293 0.008449 0.002072 0.011370 0.002179 0.001223
Equity(41 [ARCB]) -0.002544 -0.022845 -0.003178 -0.020812 -0.045639 -0.013627 -0.026887 -0.020812 0.001597 -0.011555 -0.007254
Equity(52 [ABM]) -0.038961 0.000000 -0.005106 -0.020771 -0.026316 -0.016818 -0.029065 -0.034676 -0.033719 -0.034995 -0.025109
Equity(53 [ABMD]) 0.057736 -0.001141 -0.004295 0.020061 0.039680 0.025263 0.037600 0.031597 0.036072 0.048895 0.024788
In [10]:
results.tail()
Out[10]:
Asset Growth 10d Asset Growth 1d Asset Growth 2d Asset Growth 3d Asset Growth 4d Asset Growth 5d Asset Growth 6d Asset Growth 7d Asset Growth 8d Asset Growth 9d Returns
date security
2018-09-26 00:00:00+00:00 Equity(51542 [DNLI]) 0.150759 0.020192 0.011439 0.016284 0.038669 0.036639 0.133547 0.088205 0.168502 0.155144 0.027330
Equity(51576 [CASA]) -0.001304 -0.009056 -0.054904 -0.061275 -0.060123 -0.037688 -0.022336 -0.030993 -0.046671 -0.019834 -0.028894
Equity(51580 [NMRK]) -0.078969 -0.036522 -0.022065 0.011872 0.023084 0.004533 -0.008057 -0.038194 -0.051370 -0.074353 0.051048
Equity(51615 [LILA_K]) 0.071142 -0.004655 -0.008349 -0.008349 -0.005119 0.025912 0.045477 0.054761 0.029369 0.045477 0.034816
Equity(51618 [NTR]) 0.031267 0.014572 0.008210 -0.001730 -0.015185 -0.011813 -0.002247 0.031083 0.022498 0.025586 -0.024098

We extract, shift,mask,recode and split the information for the X_train and X_test (input variables) and the Y_train and Y_test(target) variable, using the information in the results dataframe.

We split our data into training (80%) and testing (20%).

In [11]:
import numpy as np

training = 0.8
In [12]:
results_wo_returns = results.copy()
returns = results_wo_returns.pop('Returns')
Y = returns.unstack().values
X = results_wo_returns.to_panel() 
X = X.swapaxes(2, 0).swapaxes(0, 1).values # (factors, time, stocks) -> (time, stocks, factors)
In [13]:
n_time, n_stocks, n_factors = X.shape
train_size = np.int16(np.round(training * n_time))
X_train_aux, Y_train_aux = X[:train_size, ...], Y[:train_size]
X_test_aux, Y_test_aux = X[(train_size+n_fwd_days):, ...], Y[(train_size+n_fwd_days):]

We check how many (days, stocks, varaibles) we have in the training set before fitering nans

In [14]:
n_time, n_stocks, n_factors = X_train_aux.shape
print X_train_aux.shape, n_time* n_stocks
(68, 2293, 10) 155924

We check how many (days, stocks, varaibles) we have in the testing set before fitering nans

In [15]:
n_time, n_stocks, n_factors = X_test_aux.shape
print X_test_aux.shape, n_time* n_stocks
(12, 2293, 10) 27516

We crate a helper function shift_recode_mask_data () that

  • Shift factors (input variables - returns) to align these with the future target (return n_fwd_days days ahead).
  • Recode the target Y as 1 and -1
  • Eliminates examples thathave nan values or are not in the classes 1 and -1
In [16]:
def shift_recode_mask_data(X, Y, upper_percentile=100-percentile, lower_percentile=percentile, n_fwd_days=1):
    # Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
    shifted_X = np.roll(X, n_fwd_days+1, axis=0)
    
    # Slice off rolled elements
    X = shifted_X[n_fwd_days+1:]
    Y = Y[n_fwd_days+1:]
    
    n_time, n_stocks, n_factors = X.shape
    
    # Look for biggest up and down movers
    upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
    lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]
  
    upper_mask = (Y >= upper)
    lower_mask = (Y <= lower)
    
    mask = upper_mask | lower_mask # This also drops nans
    mask = mask.flatten()
    
    # Only try to predict whether a stock moved up/down relative to other stocks
    Y_binary = np.zeros(n_time * n_stocks)
    Y_binary[upper_mask.flatten()] = 1
    Y_binary[lower_mask.flatten()] = -1
    
    # Flatten X
    X = X.reshape((n_time * n_stocks, n_factors))

    # Drop stocks that did not move much (i.e. are not in the upper_percentile or the lower_percentile )
    X = X[mask]
    Y_binary = Y_binary[mask]
    
    # Drop stocks with nan returns
    masknan=  ~np.isnan(X).any(axis=1)
    X = X[masknan]
    Y_binary = Y_binary[masknan]
    
    return X, Y_binary
In [17]:
X_train, Y_train = shift_recode_mask_data(X_train_aux, Y_train_aux, n_fwd_days=n_fwd_days)
X_test, Y_test = shift_recode_mask_data(X_test_aux, Y_test_aux, n_fwd_days=n_fwd_days, 
                                             lower_percentile=50, 
                                             upper_percentile=50)

We check how many examples we have in the traning and testing set after applying shift_recode_mask_data ()

In [18]:
X_train.shape, X_test.shape
Out[18]:
((67493, 10), (13133, 10))

Explore the data visualy

In [19]:
import matplotlib.pyplot as plt

X = X_train
Y = Y_train

color = [] 
for i in range(len(Y)):
    if Y[i] == 1:
        color.append('green')
    else:
        color.append('red')
        
In [20]:
plt.subplot(3, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 1d')

plt.subplot(3, 3, 2)
plt.scatter(X[:, 0], X[:, 2], c=color, alpha= 0.6, s=10, edgecolor='k')  
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 2d')

plt.subplot(3, 3, 3)
plt.scatter(X[:, 0], X[:, 3], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 3d')

plt.subplot(3, 3, 4)
plt.scatter(X[:, 0], X[:, 4], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 4d')

plt.subplot(3, 3, 5)
plt.scatter(X[:, 0], X[:, 5], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 5d')

plt.subplot(3, 3, 6)
plt.scatter(X[:, 0], X[:, 6], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 6d')

plt.subplot(3, 3, 7)
plt.scatter(X[:, 0], X[:, 7], c=color, alpha= 0.6, s=10, edgecolor='k')  
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 7d')

plt.subplot(3, 3, 8)
plt.scatter(X[:, 0], X[:, 8], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 8d')

plt.subplot(3, 3, 9)
plt.scatter(X[:, 0], X[:, 9], c=color, alpha= 0.6, s=10, edgecolor='k') 
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 9d')
Out[20]:
<matplotlib.text.Text at 0x7fb26a7da390>
In [21]:
x = X[:,0]

mask = (Y == 1)
xg = x[mask]

mask = (Y == -1)
xr = x[mask]

xlim = (-2, 2)
bins = np.linspace(xlim[0], xlim[1], 200)

plt.hist(xr, bins, alpha=0.6,  histtype='stepfilled', label='red',  color='red')
plt.hist(xg, bins, alpha=0.6,  histtype='stepfilled', label='green', color='green' )
plt.legend(loc='upper right')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Probability by class ')
plt.show()
In [22]:
import pandas as pd
labels = ['Asset Growth 10d',
          'Asset Growth 1d',
          'Asset Growth 2d',
          'Asset Growth 3d',
          'Asset Growth 4d',
          'Asset Growth 5d',
          'Asset Growth 6d',
          'Asset Growth 7d',
          'Asset Growth 8d',
          'Asset Growth 9d']      
df = pd.DataFrame(X_train, columns=labels)
df['target'] = Y_train
df.sample(20)
Out[22]:
Asset Growth 10d Asset Growth 1d Asset Growth 2d Asset Growth 3d Asset Growth 4d Asset Growth 5d Asset Growth 6d Asset Growth 7d Asset Growth 8d Asset Growth 9d target
59810 0.065367 -0.021075 -0.031283 -0.048156 0.054484 0.061714 0.094229 0.099408 0.091657 0.026519 -1.0
61216 -0.045216 -0.021024 -0.028097 -0.039915 -0.048218 -0.047469 -0.048717 -0.051202 -0.049215 -0.049463 -1.0
2681 0.002132 0.002132 0.000532 0.001065 0.002132 0.005348 0.007503 0.010753 0.009667 0.006424 -1.0
22392 -0.148418 -0.050204 -0.062918 -0.090909 -0.083170 -0.100257 -0.062918 -0.108280 -0.119497 -0.126092 -1.0
43004 -0.042237 -0.003744 -0.063365 -0.099707 -0.084423 -0.082740 -0.082107 -0.081896 -0.065996 -0.054265 1.0
22941 -0.063342 0.002836 -0.008171 -0.001963 0.020068 -0.015494 -0.054742 -0.060203 -0.052650 -0.050549 1.0
37685 0.030153 -0.000577 -0.003741 -0.002114 -0.004219 -0.015732 0.005811 0.020236 0.028113 0.024462 1.0
46896 -0.010240 -0.019136 -0.010529 -0.019988 -0.032599 -0.024510 -0.042728 -0.036457 -0.032461 -0.014277 1.0
37653 0.002799 -0.001045 -0.021281 -0.018153 -0.033273 -0.043062 -0.030767 -0.016581 -0.004630 -0.000116 -1.0
57326 0.050147 0.029894 0.075529 0.186667 0.149623 0.109034 0.153348 0.057426 0.046033 0.039922 -1.0
13621 0.098563 -0.011091 0.009434 0.020992 0.044922 0.036822 0.063618 0.087398 0.097436 0.112266 -1.0
56168 -0.002304 -0.024775 -0.022573 0.040865 0.021226 0.082500 0.021226 0.014052 0.043373 -0.103520 -1.0
2586 0.053876 0.008593 0.033052 0.022400 0.042792 0.047732 0.044876 0.052333 0.059116 0.058142 -1.0
55587 0.038585 -0.027108 -0.039915 -0.042558 -0.011369 0.002883 -0.018024 -0.015458 0.005112 0.023540 1.0
51871 0.445031 0.415159 0.421280 0.469589 0.457853 0.441228 0.468275 0.479514 0.497721 0.497721 1.0
25372 0.093347 0.008563 0.005216 0.084954 0.057357 0.041257 0.027132 0.069087 0.089414 0.079430 -1.0
54689 0.043006 0.001064 0.056117 0.068104 0.058493 -0.001062 0.008574 0.064480 0.068104 0.065685 -1.0
12301 0.072311 0.002398 0.008760 0.017951 0.019606 0.039900 0.026469 0.054673 0.058730 0.041436 1.0
14335 0.033461 0.034450 0.029524 0.015023 -0.006434 -0.007346 -0.010979 0.013121 0.010280 0.024645 1.0
21328 0.072544 -0.032053 -0.030620 -0.003044 0.009005 0.018173 0.034248 0.035758 0.028751 0.046963 -1.0
In [23]:
import seaborn as sns
sns.set()
palette = ['#FF0000','#00FF00']
sns.set_palette(palette)
sns.pairplot(df, vars=labels,  hue='target', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 10, 'edgecolor': 'k'});

Train the Gaussian Naive Bayes model

In [24]:
start_timer = time()

# Train classifier
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, Y_train);

end_timer = time()
In [25]:
print "Time to train : %0.2f secs" % (end_timer - start_timer)
Time to train : 0.01 secs
In [26]:
from sklearn import metrics

Y_pred = clf.predict(X_train)
print('Accuracy on train set = {:.2f}%'.format(metrics.accuracy_score(Y_train, Y_pred) * 100))
Accuracy on train set = 51.63%
In [27]:
# Predict!
Y_pred = clf.predict(X_test)

Y_pred_prob = clf.predict_proba(X_test)
In [28]:
print 'Predictions:', Y_pred
print 'Probabilities of class == 1:', Y_pred_prob[:, 1] * 100
Predictions: [-1.  1. -1. ..., -1. -1.  1.]
Probabilities of class == 1: [ 45.69700146  51.26965457  36.74938274 ...,  48.82076307  41.88956584
  78.2584203 ]
In [29]:
print('Accuracy on test set = {:.2f}%'.format(metrics.accuracy_score(Y_test, Y_pred) * 100))
print('Log-loss = {:.5f}'.format(metrics.log_loss(Y_test, Y_pred_prob)))
Accuracy on test set = 53.12%
Log-loss = 0.71714