German Hernandez
Based on the post Machine Learning on Quantopian - Thomas Wiecki in Quantopian
This Notebook uses a Gaussian Naive Bayes model to predict if a stock will have a return n_fwd_days after that will be in the top percentile% of returns (class 1) of the lower percentile% (class -1) using as input variables the returns of 1,2,3,,4,5,6,7,8,9 and 10 days before .
n_fwd_days = 5 # number of days to compute returns over
percentile = 25 # target percetile of the prediction
We use dayly returns of the good quality Quantopian tradable stocks QTradableStocksUS() in a period of between the start and end dates.
from quantopian.pipeline.filters import QTradableStocksUS
universe = QTradableStocksUS()
import pandas as pd
start = pd.Timestamp("2018-05-26")
end = pd.Timestamp("2018-09-26")
We use the Quantopian Pipeline API that allows to build preprocesing filters form multiple stokcs to calcute the decision variables that we want use in trading algorithm.
We import the Returns function form pipeline because our input variables are past returns and our predicted class depends on the n_fwd_days ahead return.
from quantopian.pipeline.factors import Returns
We define the function make_factors() that define the fucntions that will calculate the input variables for the classfication, in Quantopian the input variables used to make decision in trading algorithsm are called factors
We define a function inside make_factors() for ecah one of the 1,2,3,,4,5,6,7,8,9 and 10 previous returns that we are using as input variables, in order to to this we call Returns(), one of the Built-in Factors in the the [Quantopian Pipeline API] Returns is only called with window_lengt parameter(number of days of caculate te return) so is using the default inputs= [USEquityPricing.close] but returns can be used other inputs like inputs=[USEquityPricing.open]
The function make_factors() returns a list of names and pinter to the fucntions that will be used to buld the pipeline that calculates the input variables.
def make_factors():
def Asset_Growth_1d():
return Returns(window_length=2)
def Asset_Growth_2d():
return Returns(window_length=3)
def Asset_Growth_3d():
return Returns(window_length=4)
def Asset_Growth_4d():
return Returns(window_length=5)
def Asset_Growth_5d():
return Returns(window_length=6)
def Asset_Growth_6d():
return Returns(window_length=7)
def Asset_Growth_7d():
return Returns(window_length=8)
def Asset_Growth_8d():
return Returns(window_length=9)
def Asset_Growth_9d():
return Returns(window_length=10)
def Asset_Growth_10d():
return Returns(window_length=11)
all_factors = {
'Asset Growth 1d': Asset_Growth_1d,
'Asset Growth 2d': Asset_Growth_2d,
'Asset Growth 3d': Asset_Growth_3d,
'Asset Growth 4d': Asset_Growth_4d,
'Asset Growth 5d': Asset_Growth_5d,
'Asset Growth 6d': Asset_Growth_6d,
'Asset Growth 7d': Asset_Growth_7d,
'Asset Growth 8d': Asset_Growth_8d,
'Asset Growth 9d': Asset_Growth_9d,
'Asset Growth 10d': Asset_Growth_10d
}
return all_factors
factors = make_factors()
factors
We import the Pipeline function from the Quantopian Pipeline API that build a preprocesing filters from a dictionary of factors names and pointers.
from quantopian.pipeline import Pipeline
We use the Pipeline to define the make_history_pipeline() that will produce the filter that will be applied to obtain build datafarem with the information of the input and target variables.
from quantopian.pipeline.data.builtin import USEquityPricing
def make_history_pipeline(factors, universe, n_fwd_days=5):
# Build dictionary of factors names and definitions used to calculate the information of the input variables
factor_ranks = {name: f() for name, f in factors.iteritems()}
# Add to the dictionary the factor name and definitios used to calculate the information of the target variable
factor_ranks['Returns'] = Returns(inputs=[USEquityPricing.open],window_length=n_fwd_days)
print factor_ranks
pipe = Pipeline(screen=universe, columns=factor_ranks)
return pipe
history_pipe = make_history_pipeline(factors, universe, n_fwd_days=n_fwd_days)
history_pipe
We import the run_pipeline function from the Quantopian Pipeline API that receives a pipe, a star_date and end_date, and builds data frame with the the information of the input and target variables in that period.
from quantopian.research import run_pipeline
We call run_pipeline with the history_pipe between to between the start and end dates.
from time import time
start_timer = time()
results = run_pipeline(history_pipe, start_date=start, end_date=end)
results.index.names = ['date', 'security']
end_timer = time()
print "Time to run pipeline %.2f secs" % (end_timer - start_timer)
results.head()
results.tail()
We extract, shift,mask,recode and split the information for the X_train and X_test (input variables) and the Y_train and Y_test(target) variable, using the information in the results dataframe.
We split our data into training (80%) and testing (20%).
import numpy as np
training = 0.8
results_wo_returns = results.copy()
returns = results_wo_returns.pop('Returns')
Y = returns.unstack().values
X = results_wo_returns.to_panel()
X = X.swapaxes(2, 0).swapaxes(0, 1).values # (factors, time, stocks) -> (time, stocks, factors)
n_time, n_stocks, n_factors = X.shape
train_size = np.int16(np.round(training * n_time))
X_train_aux, Y_train_aux = X[:train_size, ...], Y[:train_size]
X_test_aux, Y_test_aux = X[(train_size+n_fwd_days):, ...], Y[(train_size+n_fwd_days):]
We check how many (days, stocks, varaibles) we have in the training set before fitering nans
n_time, n_stocks, n_factors = X_train_aux.shape
print X_train_aux.shape, n_time* n_stocks
We check how many (days, stocks, varaibles) we have in the testing set before fitering nans
n_time, n_stocks, n_factors = X_test_aux.shape
print X_test_aux.shape, n_time* n_stocks
We crate a helper function shift_recode_mask_data () that
def shift_recode_mask_data(X, Y, upper_percentile=100-percentile, lower_percentile=percentile, n_fwd_days=1):
# Shift X to match factors at t to returns at t+n_fwd_days (we want to predict future returns after all)
shifted_X = np.roll(X, n_fwd_days+1, axis=0)
# Slice off rolled elements
X = shifted_X[n_fwd_days+1:]
Y = Y[n_fwd_days+1:]
n_time, n_stocks, n_factors = X.shape
# Look for biggest up and down movers
upper = np.nanpercentile(Y, upper_percentile, axis=1)[:, np.newaxis]
lower = np.nanpercentile(Y, lower_percentile, axis=1)[:, np.newaxis]
upper_mask = (Y >= upper)
lower_mask = (Y <= lower)
mask = upper_mask | lower_mask # This also drops nans
mask = mask.flatten()
# Only try to predict whether a stock moved up/down relative to other stocks
Y_binary = np.zeros(n_time * n_stocks)
Y_binary[upper_mask.flatten()] = 1
Y_binary[lower_mask.flatten()] = -1
# Flatten X
X = X.reshape((n_time * n_stocks, n_factors))
# Drop stocks that did not move much (i.e. are not in the upper_percentile or the lower_percentile )
X = X[mask]
Y_binary = Y_binary[mask]
# Drop stocks with nan returns
masknan= ~np.isnan(X).any(axis=1)
X = X[masknan]
Y_binary = Y_binary[masknan]
return X, Y_binary
X_train, Y_train = shift_recode_mask_data(X_train_aux, Y_train_aux, n_fwd_days=n_fwd_days)
X_test, Y_test = shift_recode_mask_data(X_test_aux, Y_test_aux, n_fwd_days=n_fwd_days,
lower_percentile=50,
upper_percentile=50)
We check how many examples we have in the traning and testing set after applying shift_recode_mask_data ()
X_train.shape, X_test.shape
import matplotlib.pyplot as plt
X = X_train
Y = Y_train
color = []
for i in range(len(Y)):
if Y[i] == 1:
color.append('green')
else:
color.append('red')
plt.subplot(3, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 1d')
plt.subplot(3, 3, 2)
plt.scatter(X[:, 0], X[:, 2], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 2d')
plt.subplot(3, 3, 3)
plt.scatter(X[:, 0], X[:, 3], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 3d')
plt.subplot(3, 3, 4)
plt.scatter(X[:, 0], X[:, 4], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 4d')
plt.subplot(3, 3, 5)
plt.scatter(X[:, 0], X[:, 5], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 5d')
plt.subplot(3, 3, 6)
plt.scatter(X[:, 0], X[:, 6], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 6d')
plt.subplot(3, 3, 7)
plt.scatter(X[:, 0], X[:, 7], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 7d')
plt.subplot(3, 3, 8)
plt.scatter(X[:, 0], X[:, 8], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 8d')
plt.subplot(3, 3, 9)
plt.scatter(X[:, 0], X[:, 9], c=color, alpha= 0.6, s=10, edgecolor='k')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Asset Growth 9d')
x = X[:,0]
mask = (Y == 1)
xg = x[mask]
mask = (Y == -1)
xr = x[mask]
xlim = (-2, 2)
bins = np.linspace(xlim[0], xlim[1], 200)
plt.hist(xr, bins, alpha=0.6, histtype='stepfilled', label='red', color='red')
plt.hist(xg, bins, alpha=0.6, histtype='stepfilled', label='green', color='green' )
plt.legend(loc='upper right')
plt.xlabel('Asset Growth 10d')
plt.ylabel('Probability by class ')
plt.show()
import pandas as pd
labels = ['Asset Growth 10d',
'Asset Growth 1d',
'Asset Growth 2d',
'Asset Growth 3d',
'Asset Growth 4d',
'Asset Growth 5d',
'Asset Growth 6d',
'Asset Growth 7d',
'Asset Growth 8d',
'Asset Growth 9d']
df = pd.DataFrame(X_train, columns=labels)
df['target'] = Y_train
df.sample(20)
import seaborn as sns
sns.set()
palette = ['#FF0000','#00FF00']
sns.set_palette(palette)
sns.pairplot(df, vars=labels, hue='target', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 10, 'edgecolor': 'k'});
start_timer = time()
# Train classifier
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, Y_train);
end_timer = time()
print "Time to train : %0.2f secs" % (end_timer - start_timer)
from sklearn import metrics
Y_pred = clf.predict(X_train)
print('Accuracy on train set = {:.2f}%'.format(metrics.accuracy_score(Y_train, Y_pred) * 100))
# Predict!
Y_pred = clf.predict(X_test)
Y_pred_prob = clf.predict_proba(X_test)
print 'Predictions:', Y_pred
print 'Probabilities of class == 1:', Y_pred_prob[:, 1] * 100
print('Accuracy on test set = {:.2f}%'.format(metrics.accuracy_score(Y_test, Y_pred) * 100))
print('Log-loss = {:.5f}'.format(metrics.log_loss(Y_test, Y_pred_prob)))