Simple Machine Learning Example Mk II

lets make america great again

posted Jul 17, 2015

Machine Learning Show Q

My original machine learning example was a popular post, and I figure it's about time for an update.

Although machine learning usually seems complicated at first, it's actually easy to work with.

Here, a model is created based off of past events and their outcomes. There are 3 input variables, or previous events, considered in this algorithm. They are the previous 3 days' changes in price. The outcome is whether a price increased or decreased in the following bar. Many of these events and their outcomes are used to generate a model using regression in scikit-learn . The model is then used to try to predict future changes in price.

Note that this is just an example, and should be improved before real use. Clone the algorithm, and let me know if you have any questions!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

44 responses

Jul 18, 2015

Cheers for posting!

Code is so clean and simple compared to the models I have seen written entirely in python

Yagnesh Desai

Jul 18, 2015

Very good concept . . and the framework given is also neat.

I tried my hands at it and found that everytime I backtest without changing code over
same date range results are different.
ie every time the model built with same input and output data set is behaving differently.

Ahmed Kassem

Jul 18, 2015

Dears,

Can anyone nominate a good trading Algo for me? I need to buy a good one with monthly ROI 10%

Thank you

lets make america great again

Jul 19, 2015

How come you are generating the model at the end of each week instead of each day? and have you had much experience with the other esembles like classifier etc?

Jul 20, 2015

Yagnesh, that's just because of the model I am using — random forest. A different non-random model could be chosen, as well.

d36, that's a mostly arbitrary choice I made as an example, and in hopes to slightly speed up the backtest. The model should hardly change much day to day when I am considering 400 days of history, but it will usually change a nontrivial amount week to week.

Disclaimer

Yagnesh Desai

Jul 20, 2015

I thought so. Even similar experience with neural nets they learn different things from same
set of data every time.

I am not expert on machine learning but went through sklearn and think 'polynomial interpolation'
and Decision tree regression with Adaboost has some promise.

Jul 22, 2015

Yeah, there is definitely a ton of background on selecting different models for different situations. It is very easy to use machine learning to solve a problem decently well, but much more difficult to be sure how it is applying to the situation and to select exactly the right model. Many models can be applied generally, though, and work fine in most situations.

Disclaimer

Robby F

Jul 22, 2015

In the trade method, should the prediction be based on recent_prices or the first difference of recent_prices, like in create_model?

Jul 23, 2015

You're right! My mistake. I was mucking around and I forgot to add that line back in. I updated the post, thanks for pointing that out.

Disclaimer

shudong zhang

Dec 29, 2015

Hi,author.Could you supply the code with minute ?I am curious about the performance in miuntes.Many Thanks

Dec 30, 2015

Actually, the code is in minutes! This is indicated above the backtest. Again, this is just a code example and you should not expect to make money with this.

Disclaimer

Giuseppe Moriconi

Dec 30, 2015

Hello Gus!
I was wondering what this 'fit' method really does..
In simple words can we say that it looks at the current last 3 days and search back in history if there was something similar and then consider the day after these 3 days patterns to see if history repeat itself to make a prediction?
In this case i have a question..
(sorry for my bad english :) ) i was doing something that looks similar but with a 'manual' approach, do you think this machine learning method is more efficient than a simple correlation analysis when it comes to find similar patterns in history?

Giuseppe Moriconi

Dec 31, 2015

i have updated my notebook with some comments ( cannot edit the previous notebook attachment )

Jan 5, 2016

Thanks for the explanation of your work, cool notebook!

That's what it does to an extent, however, this is dependent upon the model used by sk-learn. For example, the RandomForestRegressor I'm using makes use of decision trees. It's not an easy subject, and I'm not an expert myself, but you can read about it here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble

It looks to me like you made your own sort-of model that you are using to look at the price time series. You can simply compare models to see what works best, but beware that you may be overfitting. Also, generally, it's difficult to make signals based purely off of price time series work! It may be wise to implement some other streams of information in, like volume data etc. Good luck.

Disclaimer

Owen Leddy

Jan 6, 2016

I haven't been able to replicate your backtest results so far, Gus. I take it the algorithm is stochastic, since I get different results each time I run it. I have yet to see it produce positive returns, though. This is what I get running the code as-is on the same time period. Any idea why the stochastic effects might be giving such drastically different outcomes? Is there a good way to make this deterministic?

Jan 7, 2016

The reason it's stochastic is because this uses a random forest model. I just selected this as an example because it's a common model. You can read about alternative models here. As I noted, this is just an example, and you should not expect to get consistent positive returns out of something like this. In order to improve the likelihood of success, you should broaden the scope of the input data (for example, include trade volume, earnings, etc.). It's generally very difficult to get an algorithm that yields consistently positive returns. Good luck!

Disclaimer

Mikko M

Feb 7, 2016

Just a small FYI for anyone interested in ensemble models, default number of individual classifiers in sklearn is very (too) small. Random forest should have at least hundreds (thousands is better) classifiers. Default in sklearn is 10 (!). As these are [almost] random decision trees it's no wonder that youy will get wildly different results each time.

Corey MacDonald

Feb 7, 2016

So, I saw this and was interested in the algo. When I ran it, I noticed something rather curious {not the randomforrest(), that I understand}. Rather, that when running the algo, it would place a buy order, and then the subsequent day place a sell order for twice the amount held. Then it gets more bazzare. The algo would buy at 140.57, sell at 139.00; then magic occurs, and your cash value goes up a few grand.

I ran this back test during 1/1/2008-1/1/2009 to stress test, but I think I broke something.

Ideas??

Feb 9, 2016

Corey, that is to be expected because we are either going short or long. To go from long to short we For your second point, if we short at $140.57 and then reduce our position to zero at $139.00 then we have made just made $1.57 per share. Are you sure you're looking at a long and not a short? Other unexpected cash fluctuations may occur for a variety of reasons: slippage taking effect, commission, only being able to buy integer numbers of shares, etc. These are all designed to best simulate market conditions. Cash fluctuations should not be too surprising, although a few thousand dollars is a lot in this case, so that sounds like profit showing up in cash later due to us trading in daily mode. If you want to dig deeper you can log parameters of interest or run in minute mode. Hope that helps.

Disclaimer

Karl S

Feb 11, 2016

I ran this with a couple changes:
1. change stock symbol to BDCL
2. change look back to 10 days.

It lost about 100% of the money, so then I changed logic to sell short when the prediction was up and it still lost about 100%.

Is it just frequent trading and commissions that are killing it?

Feb 12, 2016

Looks like it is mostly due to slippage in this case. This can be fixed by limiting trading amount or limiting trading frequency. Here is your algorithm with slippage and commission off:

Disclaimer

Joseph Everly

Oct 18, 2016

Gus,

I've been playing around with this model, and I've been trying to add in another independent variable, but keep getting errors. Any chance you could post an updated backtest for us that has something like volume as an additional independent variable?

Many thanks,

Joseph

Oct 18, 2016

The best way to do this, I think, is to just add the inputs together into the same list for each sample. Here's an example with volume as an input. I also changed how often the model was generated and when we trade. Remember, this is just an example of how to use some sklearn models. There are also better ways to implement this now, like with the Pipeline API.

Disclaimer

Joseph Everly

Oct 18, 2016

Gotcha. Thanks for taking the time to post this, Gus!

arshpreet singh

Mar 8, 2017

@ Gus Gordon
Hello Gus, I am not able to train any other model using your method of implementation ML. I tried your code as template and applied QDA(),LogisticRegression,SVC() on features like daily_returns,multiple_day_returns, rolling_mean and time_lagged but each time except RandomForest It returns error that model is not fit yet.

Here is the code:
https://gist.github.com/arshpreetsingh/7ac097ae9097a7a859976342db8bbe93

Is there something wrong in the implementation? On the other hand in my local IPython /Jupyter notebook I am able to train models.

vladimir prelovac

Mar 12, 2017

Gus, I wonder can the framework be easily modified to have multiple stocks as input?

Giuseppe, your notebook is very cool, do you have an algorithm framework similar to this one for it?

Vadim Smolyakov

Jun 6, 2017

I modified the code to include multiple stocks as an input to Random Forest Regressor. A new model is trained for each stock. I parameterized the regressor with 64 trees and a max depth of 6. I suppose the parameters can be tuned using cross-validation. A good alternative to Random Forest is XGBoost (https://github.com/dmlc/xgboost). However, it's not available yet in the python IDE.

Mikan S

Jun 6, 2017

How does something like scikit-learn get added as a support package to this quantopian platform?

Eric Novinson

Jul 21, 2017

I tested it with volume alone and a different lookback period and history range. Seems like it can still generate alpha.

Anna Lundy

Jul 25, 2017

Thanks for sharing, it is an amazing concept and the framework is really looking as clean as possible. We need more people like you in the world!

Alice Clarkson

Jul 27, 2017

When people first explained machine learning to me it sounded like rocket science, but as you can see from this explanation, it's really not that hard to work with. Thanks Gus! Really great post.

Eric Novinson

Jul 27, 2017

I ran a backtest using volume to predict SPY moves but then had the algo trade XIV instead. Huge drawdowns so this strategy needs more work, but recent results seem promising.

Ardavan Kamran

Jan 18, 2018

Gus, any idea how to add a second variable to the Random Forrest Prediction? For example, if I wanted to add the MACD from Ta-lib as a variable, how would I do that?

Nov 15, 2018

I need help using this algo with order_optimal_portfolio. Can someone recreate?

@Vadim Smolyakov I have tried using this algo with Gradient Boosting Regression from sklearn. It's similar to that of XGBoost in that they both use gradient boosted trees, a form of ensemble learning. Does anyone know if Quantopian has LightGBM support? It is much more faster and efficient due to its low-level architecture.

Nov 19, 2018

Recreated using two regression models: prediction = 0.5 random forest + 0.5 gradient boosting. The idea is that we want don't base our decision purely off of one regression technique. Still managed to create alpha using MSFT instead of SPY. Attaching a visual pipeline image that explains the process soon. Still assumes slippage and commissions.

Dec 8, 2018

Adding some more complexity to the "simple" program just for fun :

Utilizing a scaler on the training data of this trading algorithm using Sklearn's MinMaxScaler, so that the data is now scaled from -1 to 1 .
Giving the ability to use multiple regression models to create a prediction.
Healthcare + tech are a good combo since they naturally diversify from each other.
Decreasing the frequency of shorting to once a week (previously, there was a chance it would sell everyday).

ps. I wouldn't use gradient boosting or any ensemble methods in the real world. Bayesian classifiers, in my opinion, should have a better performance.
@Vadim Smolyakovl ,would there not be a flaw in using cross-validation with time-series data since we are dealing with sequential movements?

So it's safe to say that this is a sliding window model. I challenge someone to create an expanding window model and use AUC to compare the models. I have turned this into a logistic regression problem (msg me if you want the script). It should be fairly easy to convert.

Vladimir

Simple equal weight rebalanced daily

# ------------------------------------------------  
security_list, lev  = symbols('MSFT', 'ANTM'), 1.0  
# ------------------------------------------------  
def initialize(context):  
    schedule_function(trade, date_rules.every_day(), time_rules.market_open(minutes = 65))

def trade(context, data):  
    for sec in security_list:  
        order_target_percent(sec, lev / len(security_list))

has such performance for the same period:

100000

START
06/03/2016
END
09/30/2018

Total Returns
127.1%
Benchmark Returns
44.62%
Alpha
0.18
Beta
1.10
Sharpe
2.31
Sortino
3.58
Volatility
0.16
Max Drawdown
-11.61%

What we get from Machine Learning Models?

James Villa

@Angelo Cortez and Vladimir,

What you guys found was a period (06/03/2016 to 09/30/2018) where volume and returns of two stocks that had fantastic correlations that ML was able to easily extract alpha. But when you change the period, things don't seem to work! Just to remind you that one needs to cross validate before getting excited.

@James Villa

K-fold cross-validation will not work with time series models; see this post. Could perhaps use TimeSeriesSplit to cv data in the train_model function. The traditional supervised learning assumption of i.i.d. observations doesn't hold in this case since financial data is sequential. Though you are correct, there is a lot of room to be improved from this model:

e.g.
1. Will an expanding window model or rolling window model be more accurate?
2. How does the model perform with X-step ahead forecasts
3. Can prediction intervals help make the perform better?
4. Can we add any other endogenous variables?
5. Does it only work for a subset of stocks? (e.g. high-cap stocks)

Frank Sch

Nice ideas and a good base for further playing! I haven't been active for a while so I am not sure if my observations are correct, but maybe helpful for you guys:

(1) I think the prediction of the price change of the current day (y, i+context.lookback) is based on price changes which include the current day (x, i+context.lookback).

# For each day in our history  
for i in range(context.history_range-context.lookback-1):  
    X.append(price_changes[i:i+context.lookback]) # Store prior price changes including the day's price change  
    Y.append(price_changes[i+context.lookback]) # Store the day's price change

Would Y.append(price_changes[i+context.lookback+1]) work better? Or even more days and not only the next day?

(2) In the last example, a prediction is made for the sum of price and volume changes, however it would be sufficient to know if the price is going up or down, so Y could be shortened.

    #Old  
    Y.append(price_changes[i+context.lookback] + volume_changes[i+context.lookback]) # Store the day's volume change  
    #New  
    Y.append(price_changes[i+context.lookback+1]) # Store the day's price change only

Also, a little swap happened in the variable names:

    rfr = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 150)  
    gbr = RandomForestRegressor()

(3) GBR can give better results if parameters (e.g. learning curve) are slightly adopted, e.g.

    gbr = GradientBoostingRegressor(learning_rate = 0.01, n_estimators = 150, max_depth = 4, min_samples_split = 2)

(4) Mean forecast error can be calculated without negative impact by fitting twice

    # Generate our models  
    rfr = RandomForestRegressor()  
    gbr = GradientBoostingRegressor(learning_rate = 0.01, n_estimators = 150, max_depth = 4, min_samples_split = 2)

    # Test our models on independent test data  
    offset = int(len(X) * 0.8)  
    X_train, Y_train = X[:offset], Y[:offset]  
    X_test, Y_test = X[offset:], Y[offset:]

    rfr.fit(X_train, Y_train)  
    rfr_me = math.sqrt(mean_squared_error(Y_test, rfr.predict(X_test)))  
    context.rfr_me[idx] = rfr_me

    gbr.fit(X_train, Y_train)  
    gbr_me = math.sqrt(mean_squared_error(Y_test, gbr.predict(X_test)))  
    context.gbr_me[idx] = gbr_me

    # Fit our models with all data  
    rfr.fit(X, Y)  
    gbr.fit(X, Y)

and recording the ratio can show that GBR is slightly better:

    #record(mean_error_rfr = context.rfr_me[idx])  
    #record(mean_error_gbr = context.gbr_me[idx])  
    record(mean_error_gbr_rfr_ratio = context.gbr_me[idx]/context.rfr_me[idx])

(5) Finally, it seems like training is done with closing prices, but trading is done including the opening price, as the schedule_function starts the trading every morning and:

    recent_prices = data.history(security, 'price', context.lookback+1, '1d').values

Maybe training should also include open price in every last data point.

Have fun,
Frank

Frank Sch

One addition on scaling: according to this study, the impact of scaling on bagging algorithms such as 'RandomForestRegressor' and boosting algorithms such as 'GradientBoostingRegressor' is negligible since they are decision-tree based. Furthermore, this scaler

context.scaler = MinMaxScaler(feature_range=(-1, 1))

scales each feature individually and does not preserve symmetry, so it is probably better not to use it here to make sure all price increases remain > 0 and all price decreases < 0 as input for the ML algo.

@Frank, thanks for peer reviewing this.

This talk highlights how hyperparameter tuning does not have an effect on alpha, so I am not sure how adding hyperparameter tuning using tree methods could improve returns.
I have read that study where you mention scaling does not have an impact on performance for trees. I did not post it, but the reason I implemented scalers was to fit on volume changes, and then using that fit, I transform both price and volume changes. It may sound dumb, but I think it could find those correlations between price and volume changes better than putting in raw vol/price changes. Basically, my hypothesis is fit(x1) -> transform (x1)-> transform (x2). Would love to hear your opinion on this