Implementing and Launching Deep Learning Algo

Hi Michael,

Welcome to Quantopian! theano is on our list of libraries to review and include - it's not currently available in the IDE. We get requests to include various libraries and we investigate the implications of adding them to the platform. You can try running theano on your local machine using Zipline, our open-sourced backtesting engine. The code is available on Github and you can run your algorithm against your own data source.

Also, if you haven't seen it yet, check out this sample machine learning algorithm.

Cheers,
Alisa

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Michael Segala

Hi Alisa,

Thank you for the info. Theano is extremely powerful so hopefully will make it into the IDE soon.
Is there any type of work around that can be implemented?
For instance, in handle_data(), the days price is passed to some external script that does the underlying logic and then passes back a prediction.
So basically, replace this

context.classifier.fit(context.X, context.Y) # Generate the model  
context.prediction = context.classifier.predict(changes[1:]) # Predict

with
push_to_external(date) context.prediction = get_from_external()

Thanks,
Mike

Hi Mike,

Unfortunately we have to be pretty restrictive with sending data outside so this won't be possible, although it would be nice. The reason is that we have to protect the data which we're not allowed to redistribute for download. But as Alisa suggests, you can start developing your learning algorithm in Zipline and perhaps by then we have included Theano.

Thomas

Disclaimer

Michael Segala

Ok, thank you for the advice. I will do that.

Mike

Miles Garnsey

Oct 27, 2014

Even adding Theano is unlikely to help you much as a real deep neural network needs to be trained on a GPU in order to achieve results in a reasonable time frame. Unless you have some sort of way of modelling up the network using data you have access to, exporting it, re-importing it via Quantopian and then doing incremental updates on the Quantopian platform.

Xavier Gréhant

Sep 2, 2016

Instead of (or in addition to) Theano, you might consider adding support for Keras, which is easier to use.

It would be interesting to hear from the Quantopian team on this topic. Jonathan Larkin's recent blog post touches on it (http://blog.quantopian.com/a-professional-quant-equity-workflow/ ):

Lastly, modern machine learning techniques can capture complex relationships between alphas. Translating your alphas into features and feeding these into a machine learning classifier is a popular vein of research.

Presently, it is not clear to me how the infrastructure supports this, but I don't have a background in ML. The research platform does not support parallel processing/GPUs (although numpy does vectorized computations, which is a start). It only has 4 GB of RAM, which is pretty meager these days, for even a desktop business pc, let alone an analysis workstation. Still supposing one was able to do the ML in the research platform, I don't think there is any way to provide the result to an algo for paper/real-money trading (e.g. one cannot write data to a file that can then be read by an algo). Within an algo, there is an absolute limit on computing time of 5 minutes per day (the before_trading_start limit), and the amount of memory available is TBD. One can't call backtests from within an algo. Etc. It would seem that the infrastructure is not there yet.

@ Thomas - you seem to be the guiding force within Quantopian in this area. Has there been any thinking on how you might support approaches that require commodity high-performance computing? Naively, the required hardware platform is cheap, but maybe the constraint is that you'd want to deploy it across 80,000+ users, for 24/7/365 availability? Any thoughts? I know y'all by policy/culture are pretty tight-lipped about future plans, so I understand if you don't want to comment, but it would be an interesting discussion regarding architectural changes you might make.

I am working on this. My plan is to train
ANNs on daily data on my workstation and port the resulting weights to Q. Potentially. No idea whether this is possible? Depending on the size of the portfolio I imagine Q could cope with say 1 year of out of sample testing. Again using daily data (or rather one data point per day). Presumably ensemble out of sample testing may be more of a struggle or impossible. Also, how often does one want to retrain the network has an impact. Probably needs retraining monthly or whatever outside Q with the new weights again imported.

Early days for me and no idea yet what is possible. But as far as I am concerned it is the way forward. It may prove no more profitable than conventional TA or Quant analysis. But it is a lot more interesting - or at least it presents me with a new area and thus is keeping me amused.

Anthony -

You might give a go to see how much text you can paste into the backtester before it complains. Of course, you'd need to stop the algo & re-start to update your weights. I wonder if the text size of the algo reduces the amount of memory available to the algo when it runs?

Grant

Pandasaurus

Personal opinion: The amazing results published by Google and their buddies is usually a result of hundreds of hours of training millions of points of data on hundreds of cross-validation. There isn't enough daily data in a hundred years to obtain that kind of numbers.

In my opinion, Q has a problem that their computing platform relative to what one could build with a modest amount of cash is completely off the mark. The hang-up, as I see it, is that Q can't release data that they've licensed, so they cooked up the system we have today. It is basically an effort to provide a kind of remote desktop capability to the crowd, so that everyone can sit virtually within a hedge fund with lots of tools and data at their fingertips (although the original business model may have been just to host a retail trading platform...we still see the remnants of that idea and it is probably complimentary to the new business plan). The problem is that Q can't plunk a $5K workstation on 80,000 virtual desktops (a $400M capital outlay), nor would it make sense. But for anyone serious about this field, $5K is peanuts.

Numerai takes a different approach. They use something called homomorphic encryption to distribute data to the crowd (on https://blog.numer.ai/2016/08/06/Encryption-Secrets-Properties the founder states "Homomorphic encryption has quietly made three orders of magnitude improvements in speed since it was invented, and this Moore’s Law of homomorphic encryption will be a tailwind for Numerai into the future"). Presumably, they've convinced their data vendor(s) to allow them to encrypt licensed data in a special way that allows certain problems to be solved using the encrypted data alone. It is as simple as downloading a dataset, and then uploading results (see https://numer.ai/rules). This is a genius approach, not only because it solves the problem of data access, but also the security problem of contestants not wanting to risk their code being stolen. If a contestant chooses, he can run his analysis on a stand-alone machine (not connected to the outside world), write out the results to a USB stick, and upload them to the contest. Additionally, it is completely anonymous (https://numer.ai/rules ):

You retain all intellectual property rights to your model. You never have to tell anyone how you built it and you never have to tell us who you are. You only upload your predictions.

Numerai payouts are via bitcoin and they have solved another problem of Q (in my opinion) that there is no entry point at the low end. Numerai has payouts as low a $0.01, so one doesn't have to write an institutional grade algo to manage $10M to be in the game. In terms of actually innovatively applying crowd-sourcing to a hedge fund, my initial read is that Numerai is smoking Q, but maybe I'm comparing apples-to-oranges.

@ Thomas W. - Any discussion of Numerai around the Q water cooler? Any thoughts on this homomorphic encryption jazz? Have you brought it up as a possibility with any of your data vendors? If so, how did they respond?

Grant
I have looked at Numerai with great interest but have yet to come to any conclusions. One thing you don't point out is that you have absolutely no idea whatsoever what their data represents. Its a totally different and completely alien game.

Here on Q you can take a technical or fundamental idea and test it out. Does momentum work? Flags? Pennants? Is it true that dividend growth or eps growth makes for an increasing stock price? test it all out.

The Numerai approach is the dark side of the moon. What are they giving you? Price data? EPS? Rainfall? Sunspots?

And I'm not trying to be funny. I have spent weeks now on AI research and get the point of what they are trying to do and how they are doing it. At their end of course they need to know the stock market in order to decide what data they think is relevant in predicting stock prices. At your end, you really don't care - all you want is to make the prediction on an ML algo of your choice.

But I do totally take your point - nice for the algo provider not to have to divulge a thing. Nice not to have to know diddly squat about stocks, bonds, commodities, the economy.....Most of the competitors know more about mosquitoes, ant colonies or bacteria.

Perhaps this could be an alternative approach for Q? Is that your point? Q provides data like Numerai and you specialise in the ML algo?

Yeah, it is apples versus oranges, Quantopian versus Numerai, but having been involved with Quantopian for awhile, I can see the merits in what Numerai is doing. Gotta investigate further. I'm not gonna compete with folks from the finance world with intuition about the industry, specialized trading knowledge, etc. I have zero intuition which of the many factors would be important to examine in detail, for a successful $10M+ long-short strategy. But maybe I could get up the learning curve on AI and make a few bucks on Numerai. In the end, it would be fun just sorting out how to set up a computing rig that would be competitive.

It seems that Q should be able to benefit from the same underlying homomorphic encryption that Numerai is using...if they can learn how to use it and convince their vendors that it won't get cracked.

The "alien" data is why I dropped the cause with NumerAI. Training models without being able to retain the underlying economic significance of your feature engineering just didn't seem like a worthwhile endeavor.

But there are so many research papers out there with ideas on how to harness different factors for stock prediction. I thinks it's a fascinating endeavour but I can see that it would be a lot more satisfying to curate and pre-process your own data.

Yes. Maybe the up-and-coming Q workflow & associated tools will make exploring factors easier. If it can be reduced to practice, it'll take some of the coding burden off.

Regarding dabbling in AI, it seems like Numerai might be a good starting point, since I can handle the hardware/software end, download the data, churn away, get feedback, and repeat the cycle. Probably a steep learning curve, though. I'd be starting from scratch. Are there good references on AI that would be applicable?

pangyuteng

Thanks for mentioning Numerai, Grant. I like Numerai's idea of using bitcoin, no one can tax me now. :)
I don't know too much about AI, but for deep learning techniques, I use Keras (Theano/Tensorflow in the backend) for 2d/3d image analysis, its interface is very simple and to me, friendly, compared to Theano, Lasagne or Caffe. However Caffe's got a lot more publicly available models.

I have attached a quick script I cooked up for playing with the training data provided by Numerai, some links related to deep learning is provided at the end. A MLP with relu activation and dropout layers would be a good starting point.

Below snapshot shows the logistic regression summary using a reduced training dataset from Numerai (original training sample size is about 96K).

http://www.heatonresearch.com/aifh/
http://makeyourownneuralnetwork.blogspot.co.uk/
https://iamtrask.github.io/2015/07/12/basic-python-network/
http://machinelearningmastery.com/start-here/
http://www.cs.waikato.ac.nz/ml/weka/index.html

Great notebook, thanks Ted

As a complete beginner, non mathematician, algebraist and statistician I have so far found the following helpful:

Ted, Anthony -

Thanks for the guidance. ML/AI is a real gap in my understanding and ability. I'll see what I can pick up, little by little.

Grant

Regarding pay-outs in Bitcoin, and taxes, I suspect you are still on the hook. But my guess is that there is a lot less burden on Numerai, in terms of accounting, compliance, lawyers, contracts, etc. It is definitely forward-looking, consistent with global, scalable system that could pay out $0.01 to $1M without a lot of infrastructure (of course, government regulators will undoubtedly step in and muck that up, so that accountants/lawyers/compliance/govenerment get a cut).

What do you know?! I cooked up something in 30 minutes and am in the money on numerai.

Nice Pravin! One warning, they request live results at the most of inopportune times.

pangyuteng

Frank, can you elaborate on what you mean by "they request live results"? Thanks.

@ Pravin - How did you whip something together so quickly? Do you do this sort of thing for a living, perhaps (if I might ask)? And presumably you did not start from absolute scratch? You already had some applicable code up and running?

Continuing on the theme of compare-and-contrast, Quantopian versus Numerai, Q is attempting to apply the collective intelligence of the crowd differently. Looking at http://blog.quantopian.com/a-professional-quant-equity-workflow/ , it is clear that Q is looking for the whole package in one individual--someone who can research, develop, and deploy an entire soup-to-nuts algo. If you can't do that, then "No soup for you!" An all-or-nothing, hire a virtual fund manager approach. Numerai has a different model of being able to farm out little abstracted computational puzzles for the crowd to noodle on, whose solutions plug into their workflow (probably similar to that being put forward by Q). If I understand correctly, it may be less of a winner-take-all approach, since a relatively large number of users' solutions can be cobbled together (but maybe it will turn out the same, that only a few solutions to a given puzzle are useful).

For Q, let's say I uncover an itsy-bitsy alpha that is weak and transient, but can be combined with other alphas by you (not me)? How could I get paid in proportion to my contribution? This would seem to be the analog to Numerai.

@Grant - I am a beginner in this field. Here is what I did. I used FastICA to reduce the number of features and translated them into factors. Then I used gradient boosting classifier to train and predict the class. That's it. And yes, I scaled the data before doing all this. Also tried KMeans clustering before identifying factors. I am now looking at auto_ml but xgboost does not work properly on windows. Atleast I am unable to install it.

@Grant - install anaconda python, download numerai data and try this script from a notebook. It is very basic and should get you a rank in top 25%.

from sklearn.decomposition import FastICA, PCA, NMF  
from sklearn.ensemble import GradientBoostingClassifier  
from sklearn.svm import SVC  
import numpy as np  
import pandas as pd  
from sklearn.linear_model import LogisticRegression  
from sklearn.naive_bayes import GaussianNB  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.ensemble import VotingClassifier  
from sklearn.neural_network import BernoulliRBM  
from sklearn.pipeline import Pipeline  
from sklearn import preprocessing  
from sklearn.calibration import CalibratedClassifierCV  
from sklearn.cluster import MiniBatchKMeans

training = df.values[:, :21]  
classes = df.values[:, -1]  
training = preprocessing.scale(training)  
kmeans = MiniBatchKMeans(n_clusters=500, init_size=6000).fit(training)  
labels = kmeans.predict(training)

clusters = {}  
for i in range(0, np.shape(training)[0]):  
    label = labels[i]  
    if label not in clusters:  
        clusters[label] = training[i, :]  
    else:  
        clusters[label] = np.vstack((clusters[label], training[i, :]))

params = {'n_estimators': 1000, 'max_depth': 3, 'subsample': 0.5,  
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}  
gbm = GradientBoostingClassifier(**params)  
ica = FastICA(10)

icas = {}  
for label in clusters:  
    icas[label] = ica.fit(clusters[label])

factors = np.zeros((np.shape(training)[0], 10))

for i in range(0, np.shape(training)[0]):  
    factors[i, :] = icas[labels[i]].transform(training[i, :].reshape(1, -1))

gbm = gbm.fit(factors, classes)    

tf = pd.read_csv("c:\\numerai\\numerai_tournament_data.csv")  
forecast = tf.values[:, 1:]  
forecast = preprocessing.scale(forecast)  
labels = kmeans.predict(forecast)

factors = np.zeros((np.shape(labels)[0], 10))  
for i, label in enumerate(labels):  
    factors[i, :] = icas[label].transform(forecast[i, :].reshape(1, -1))

proba = gbm.predict_proba(factors)

of = pd.Series(proba[:, 1], index=tf.values[:, 0])

of.to_csv("c:\\numerai\\myop.csv", header=['probability'], index_label='t_id')

Another quant platform / competition on which AI and probably deep learning have been used recently is quantiacs.com.

There is a guy there who goes by the nickname of Dr. Kodiak who wins most of the top prizes. If you look at the names of at least some of his algos, it is clear that he has been applying ML.

His returns are phenomenal, but his volatility is also extremely high. Apparently, the resulting (still) very high SR is all that quantiacs is interested in. His algos do seem to perform rather well OOS.

Quantiacs data are dayly only, but they do get uploaded (automatically) to your own computer for development and testing.

Sadly Quantiacs is only open to US residents.

@Pravin: Sorry, I did not realize that. Rules me out as well.

Thanks Pravin. Not sure when I'll get around to it, but I am curious about this ML stuff. I'd rather learn by doing, so Numerai might be a place to start.

@Pravin,

I would also like to thank you for a very nice working example of ML. I only needed to add one line in the very beginning of your code, i.e.

df = pd.read_csv("c:\numerai\numerai_training_data.csv"),

but I was able to guess that one :-), and then it ran, not needing much time at all to finish.

I am interested in learning about ML and potentially Deep Learning not only for finance but also in general as a very important modelling tool in science and I think the numerai data are quite interesting in this sense because they have already been cleaned and also all have values between 0 and 1. to my beginner's mind this is what is usually required for ML (normalization, they call it?), so I was a bit surprised that you did preprocessing on them nevertheless. Am I missing (misunderstanding) something, or was this just a precautionary measure?

Many thanks again,

Tim

@Tim,

Yes you are right,I forget about that line to read from file on disk. I am pre-processing data because I noticed that it gave better results than without it. Basically what they did was normalize data. What I did was standardize data.

Best regards,
Pravin

Richard Jamieson

@Pravin, I echo Tim's sentiment - thanks for sharing this, I'm also experimenting with it now - very generous of you.

I also thought it might be worth mentioning that this will only run on Python 3.x as opposed to 2.7, am I right?

Regards,
Richard

@Richard - I am running this on Python 2.7.

Sep 7, 2016

Ted,

When your predictions place in the money, you will only have 72 hours to resubmit new predictions on new data, see email below:

FRANK_VIGILANTE,

You placed 262nd in the Numerai tournament. Well done.

We’ve unlocked a live stock market dataset: https://numer.ai/live. It’s encrypted with the same key as the tournament data. Use the same model that generated your final submission in the tournament to make live predictions on this dataset.

Your predictions may be used to control capital in our hedge fund, and influence the pricing of financial instruments globally.

The data expires in 72 hours.

Numerai.
κυβερνητικός μέσω των αριθμών

pangyuteng

Thanks Frank. Very interesting.
Base on that 72 hour time frame, I am guessing they are more likely to be trading options (?)...very cool info. Thanks for sharing.

No prob Ted. They just give you a new prediction set that has different data than the one that got your model qualified, so your end ranking can fluctuate if your model does not hold up as well with the new data.

Ian Hensel

Hi Pravin,
Thanks for sharing!
I am playing with your code in a jupyter notebook and get an attributeError for df.values. I want to maintain the key/value and tried making it a dict.
Any advice on how to get around this issue would be much appreciated!

-Ian

AttributeError Traceback (most recent call last)
in ()
16 df = pd.read_csv("C:\numerai\numerai_training_data.csv"),
17
---> 18 training = df.values[:, :21]
19 classes = df.values[:, -1]
20 training = preprocessing.scale(training)

AttributeError: 'tuple' object has no attribute 'values'

Hi Ian,

Difficult to say without looking at the code. It appears that df is no longer a pandas dataframe.

Best regards,
Pravin

By the way, I took a different approach and generated PCA factors for training variables and split them into two based on corresponding classification (0 or 1). Here are the means of the two sets of 21 factors:

[ 0.01926204 -0.06985482 -0.04612691 -0.00758131 -0.00400847  0.00602093  0.01528088 -0.00275787  0.00176083  0.00104902  0.00289119  0.00610669  0.0043163   0.00014622 -0.00122243  0.00397107 -0.00502622  0.00170189 -0.00199742  0.00262424  0.00181543]

[-0.01886776  0.06842494  0.04518272  0.00742613  0.00392642 -0.00589768 -0.01496809  0.00270142 -0.00172479 -0.00102755 -0.00283201 -0.00598168 -0.00422795 -0.00014323  0.00119741 -0.00388978  0.00492334 -0.00166705  0.00195653 -0.00257052 -0.00177827]

It clearly shows that the means have an opposite side when the classification is zero or one. Something you could use to improve the score on numerai.

Ian Hensel

Pravin,

I'll have to play with this over the weekend. I did get it working on my desktop.

Silly pandas!

-Ian

I wonder if the Q could devise a thingy like Numerai. Any comments, Q team? Not so mainstream, but you've got a clever team. Seems to be interest. Maybe not scalable to $10B, though.

Personally, I haven't tried Numerai yet. Been quite busy, but the principle is very cool. Not your father's hedge fund...

See Thomas W's and teams paper on Data Robot and analysing thousands of "our" algos using ML. Better predictive power than backrests, pyfolio etc.

Similar to number.ai 's ML (ensemble?) analysis of their predictions received from participants.

Again, does one care about the underlying algos, fundamental rationale etc? Not sure I do anymore!

https://www.quantopian.com/posts/the-101-alphas-project

It never took off (as far as I know) but Q had an effort that presumably would have required some ML:

The motivating paper can be found here:

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2701346

Pages 8-15 list all of the 101 alphas. For example, we have expressions like:

Alpha#59: (-1 * Ts_Rank(decay_linear(correlation(IndNeutralize(((vwap * 0.728317) + (vwap *  
(1 - 0.728317))), IndClass.industry), volume, 4.25197), 16.2289), 8.19648))

Where did it come from? Is there a "strategic intent" along the lines of that required by the Q fund (https://www.quantopian.com/fund)? Or did a HAL 9000 computer (perhaps with a little human guidance) extract the expression from a database of financial data?

And suppose I then combine 101 of these alphas? How do I articulate a "strategic intent" then? "Uh, well, umm...the machine sorted it out, and it works!"

From the conclusion to the paper:

Technological advances nowadays allow automation of alpha mining. Quantitative trading
alphas are by far the most numerous of available trading signals that can be turned into trading
strategies/portfolios. There are myriad permutations of individual stock holdings in a (dollarneutral)
portfolio of, e.g., 2,000 most liquid U.S. stocks that can result in a positive return on
high- and mid-frequency time horizons. In addition, many of these alphas are ephemeral and
their universe is very fluid. It takes quantitatively sophisticated, technologically well-endowed
and ever-adapting trading operations to mine hundreds of thousands, millions and even billions
of alphas and combine them into a unified “mega-alpha”, which is then traded with an added
bonus of sizeable savings on execution costs due to automatic internal crossing of trades.

We're not talking humans formulating old-school hypotheses and then testing them: "Hmm...I wonder if the price of rice in China is an alpha?" A machine needs to formulate the hypotheses, test them, and then spit out the results. Additionally, if "many of these alphas are ephemeral and their universe is very fluid" again the effort would seem better suited to a machine.

On http://blog.quantopian.com/a-professional-quant-equity-workflow/, we hear mention of ML, in the context of combining alphas:

Lastly, modern machine learning techniques can capture complex relationships between alphas. Translating your alphas into features and feeding these into a machine learning classifier is a popular vein of research.

However, the use of ML in discovering alphas is not mentioned. The new Alphalens appears to be a manual tool for noodling on individual aphas, versus a function that would plug into a ML algo that would churn over Q datasets, but maybe I'm missing something (https://www.quantopian.com/posts/alphalens-a-new-tool-for-analyzing-alpha-factors).

It would be interesting to hear thoughts from the Q team on this topic. Within the proposed workflow, how would ML be implemented, potentially coupled with high-performance computing (HPC)? Is this combination viewed as a natural evolution for Q? If so, what might the platform look like?

I think the thing to do is to ignore all that you see around you and plot your own course. I do not believe it is going to be necessary to examine or train/test/predict on a billion dimension problem.

I am carefully working my way through many examples of many different ML algos in Python thanks to some excellent but simple text books I have purchased and I intend to concentrate, at least initially on price. 1 dimension if you like. Or 20 if you use 20 daily closes etc.

There are so many different opinions and so many of them held by reputable academics or practitioners who sharply disagree with each other.

I find if I get involved in all that garbage I just want to throw in the towel after 30 years of involvement with markets. Do I give a damn whether EMH is true/false? Do I care or want to venture an opinion as to whether the market has memory or is brownian motion? Does anyone get it right? Does anyone know "the Truth"?

I am thinking along the lines of training on daily data and then simply importing the "memory" (weights) and prediction mechanism into Quantopian. And then updating the weights periodically (monthly? annually?).

I love the numer.ai approach because it is a glorious cop out for me: let some other mug worry himself to death with the endless and fruitless search for "Truth"and concentrate on ML.

Anthony,

Would you be willing to share the titles of the Python / ML textbooks that you mention?

If so, my sincerest thanks. I would really appreciate some guidance in this regards.

Anthony -

The fact that you're jumping through hoops to get your ML algo deployed on Q doesn't make sense to me. Q should support it natively. Any comments, Q team?

Grant

Hey guys,

Really happy about the enthusiasm about ML here. I was holding off responding because I wanted to rather just post the content we have created but maybe it doesn't hurt to give a brief update. Gil, James Christopher and I (with the help of Jonathan Larkin) have been working over the summer on a ML workflow on Quantopian. To my positive surprise, despite some limitations, an end-to-end ML workflow is already possible and the NB we're finalizing demonstrates that. The basic idea is to have a bunch of daily factors (we have a pretty solid library of economically sensible factors at this point). These are then preprocessed (ranking, normalizing) and linked to future returns (e.g. 1-day forward). This all happens in pipeline which is perfect for that. The returns are also preprocessed. Specifically, we only look to classify the top 30% and bottom 30%. Buying the top and shorting the bottom puts us in a position where only have to be right about the relative stock movements. These are of course are all parameters you can play around with.

Ultimately, this leaves us with a 2D matrix (n_stocks * len_lookback_window, n_factors). On that we can train a classifier to predict the future relative price movements. What's interesting is that the predictions of the classifier for each stock can be viewed as just another factor (mega-alpha) which can be analyzed with alphalens for its quality. What's also interesting is that the ML part is not a computational bottleneck if you use sklearn as the algorithms are very fast (deep learning might be different but we don't have libraries on Q yet that allow this). The main bottleneck comes in if you retrain your model continuously (e.g. weekly): the amount of data that needs to be processed is quite high. That's the biggest limitation currently.

Anyway, I hope to post something soon to demonstrate all that but it certainly abstracts a lot of the complexities away and allows you to work on the part of the stack you are most excited about, be it coming up with new factors or just the ML part of combining existing factors.

Disclaimer

Thanks Thomas -

The main bottleneck comes in if you retrain your model continuously (e.g. weekly): the amount of data that needs to be processed is quite high. That's the biggest limitation currently.

Presumably, you mean that if I wanted to run a simulation back to 2002, and retrain every week within the simulation, it would be problematic? But perhaps for live trading, it'd work? Presently, there would be a 5 minute limit, due to the time-out of before_trading_start? But maybe you are approaching things all within pipeline, so the 5-minute limit is irrelevant?

Also, I have to figure that the training isn't done by running backtests, but rather looking at daily price movements, correct? Is there a path to train by running backtests in parallel? Or maybe it is fundamentally not the right approach?

Thomas
Very, very happy to hear your plans. I will certainly be joining the Q competition when I am a little further forward with my plans.

Grant
The testing/training/predicting and back testing can be run all at the same time. Using a partial fit method for "on_line " learning. In my case, I will bring to the back test a set of weights precalculated to 2002 using my own daily data. These weights will not be "re-initialised" to random or zero values but will simply be updated with the latest week's or months data. Or year. And using daily data or even weekly data should help the computational burden. Thus you do NOT have to retrain on the entire dataset each time you get new data - you just update / re-train using the new data. Or at least that is the way I currently understand it.

EDIT: by "weights" I am talking in neural network terms - the learning part, the memory of the algo. Not the weight of stocks or how much to invest.

Tim
I started with a blog post for real Noddies (like me) which was incredibly helpful:
https://iamtrask.github.io/2015/07/12/basic-python-network/
This just covers the neural network: there are many other ML algos...nonetheless deep learning (the state of the art) is based on neural networks.

I then moved to the following very simple book with code which I found excellent, again based on simple neural networks:
https://www.amazon.co.uk/Make-Your-Neural-Network-ebook/dp/B01EER4Z4G

I then worked through all 4 books by Jeff Heaton. Frankly, they are not so clear but I probably made the mistake of not following through by looking at the Python code: the books may therefore improve for me when I revisit them.
http://www.heatonresearch.com/book/

I am currently reading and meticulously working my way through the following book which is the clearest and best I have come across so far. But my learning has been very much incremental: I could not have absorbed this book so easily if I had not read the others and countless blogs first. Anyway here goes:
https://www.amazon.co.uk/Python-Machine-Learning-Sebastian-Raschka-ebook/dp/B00YSILNL0/ref=sr_1_sc_1?s=digital-text&ie=UTF8&qid=1473516101&sr=1-1-spell&keywords=python+mchine+learning

Anthony -

Yeah, I don't know how to interpret Thomas' comment. If I can train my model back to 2002 and it takes 12 hours, then going forward, it is 12 hours per week, even if I go all the way back to 2002. Or maybe I just do a tweak/update, which will take less time. Either way, the market is open only a fraction of the day, 5 days per week. I guess the problem comes in if I want to run a backtest that requires 12 hours per simulated week. Then it is a problem. But for live trading, I've got plenty of time when the market is closed to do computations (but I guess Q is constrained in some fashion with regard to offering too many free computing resources). And presumably one would not need to load all of the data into RAM; it could be read from disk in chunks, which should be cheap. It is just confusing.

Grant

It's just a tweak or update for forward trading. Just running the prediction and fit method on a small data set. But yes, I guess you are right: depending on how many dimensions you want to use it could be slow going for backtesting. We will have to wait and see what Q come up with I guess.

Personally I am excited. Bored rigid with the endless turgid rubbish and oponiated nonsense out there in the conventional TA and fundamental analysis world and looking to embrace the future.

I'm an old dog but one who craves to learn new tricks.

Anthony,

Many thanks indeed for your comprehensive answer, very kind of you!

Grant: Yes, the bottleneck exists mostly in "backtesting" (you can include ML as a pipeline factor, as we'll show). For live-trading it should be fine.

Disclaimer

Well, could be worse. As your analysis suggested, most backtests are overfit anyway. Just get all 80,000+ users to submit algos, and in 10 years, you can analyze the results! Should be some good ones in there. : )

Can't the "backtesting" be parallelized if you are just doing alpha/factor/returns analyses? For actual backtesting, maybe not?

Hi Thomas,

In an abstract sense, there must be different classes of problems that can be solved by ML. Specifically, some problems may be solved by first training on the entire data set (e.g. back to 2002), establishing a baseline model. Then, as new data rolls in (e.g. weekly), the training can be done using the baseline model plus the new data; there is no need to run the analysis on the entire data set, either because of the known nature of the problem or unproven assumptions about it.

So what's the problem that needs to be solved? Which factors are important and when? It seems like your ML algorithm has to deal with the dynamic nature of the problem. I have n_factors and m_responses. In analogy to response surface methodology (http://www.itl.nist.gov/div898/handbook/pri/section3/pri336.htm), at each relevant point in time (i.e. when I'd like to make a trading decision), I need to know what order of model to use (linear, quadratic, cubic) and which terms in the model are significant. Which model are you assuming--linear, quadratic, cubic? It matters, since it drives how much data you need, point-in-time, to build the model.

In your description above, you say:

we can train a classifier to predict the future relative price movements

In essence, you are saying that the only response of interest is returns. But the Q contest and pyfolio, for example, have a much larger set of responses that are important. I can't win the contest with outstanding returns alone (or even Sharpe, which is volatility normalized returns). What about "strategic intent" as a factor? Could I define a vector of "strategic intents" and then my ML algo would decide which ones apply at a given time, and to what degree? Then when I submit my algo for the Q fund, I could just send the time-dependent "strategic intent" vector (or would it be rejected out-of-hand, since only one time-independent "strategic intent" is allowed). Your ML algo, it would seem, can't just have one response.

One consideration is if my baseline model (trained on data back to 2002) could include information about the decay time, tau, then it potentially could be useful. Each factor will have its own point-in-time tau; in other words, it is something like tau_k(t) where k corresponds to the factor and t is time (extending back to 2002). Thus knowing tau one could set the maximum required look-back window based on historical data (and even predict if it should be longer or shorter for the next model update).

James Hutchison

I'm on my phone, does quantopian allow the pickle or json libraries?

James -

The research platform allows uploading of CSV files to its "data" folder, which can then be read in via local_csv. I don' think other formats are supported. Also, I don't think that one can write out a file from the research platform, even for local storage (the CSV files must originate as external). And the backtester and live trading platform only supply data to the research platform, not the other way around.

So, it would be interesting to understand the vision from the Q team how this ML stuff might work with regard to file read/write and the research/backtester/live trading platforms.

EDIT - see also https://www.quantopian.com/help#overview-fetcher for upload of CSV to the backtester/live trading platform.

@ Thomas -

deep learning might be different but we don't have libraries on Q yet that allow this

The libraries wouldn't necessarily need to be available to Q users. There may be common problems that you'd solve using deep learning, and then provide the results to all Q users. This would also give you a lot more flexibility in the computing platform (and no data and cloud computing security issues to fuss with, too). Presumably, your data licensing agreements would allow this. Or can you only use the data in the current mode--by individual users for researching and writing algos?

toan tran

https://www.quantopian.com/posts/quantopian-computation-intensive-algos-always-crash

Hi guys,

The current problem with running ML on Q is that Q crashes randomly. It usually finishes if you run non-intensive algos on backtests but is pretty much guaranteed to fail if you backtest for more than a few hours. I've posted about this but have gotten no real answers.

One plus is that the backtester platform natively supports running large number of backtests in parallel. I suspect that they are run independently, including the memory allocated. So, in theory, if Q would enable it, one could launch N (e.g. N = 1000) backtests from the research platform, with various parameter settings, and then pull the results back into the research platform. Then, ideally, the resulting model could be pushed back out to the live trading platform.

@ Thomas - seems like with a few tweaks on the Q side, one could train models using the backtester, given the native parallel support. Are you working things from this angle? Or maybe you have constraints that would make it not feasible?

Regarding long-running backtests, I thought I saw an absolute limit of 2 hrs., but I could be mistaken. Presumably, this restriction could be relaxed.

I think the easiest path would be to make Theano or TensorFlow and the deep-learning libraries on top of it available on research. It's not that difficult (if you exclude GPU support). I don't see how parallelism helps here. But yeah, other restrictions can be relaxed too, like max-backtest time, amount of RAM on research etc.

Having said that, for the time being, I try to work with what we have now and push that to the boundaries. A lot can already be done (as my NB hopefully shows). New features or libraries require careful deliberation and planning and always take longer than anticipated and there are many other things we're working on.

Disclaimer

Thanks Thomas -

New features or libraries require careful deliberation and planning and always take longer than anticipated and there are many other things we're working on.

That's why I asked if you have access to the Q data offline, and if there might be problems of common interest. If you remove the constraint of deploying to the Q platform altogether, would it be helpful? Fun? Interesting? Or would it be a non-starter from a technical standpoint?

if you exclude GPU support

Hmm? I've never used it, but MATLAB can do GPUs natively, I think. Is this not the case for Python? Or is it more of a cost/implementation issue on your end? It looks like Amazon offers GPU support (https://aws.amazon.com/hpc/), so it is out there in the "cloud" right?

Overall, I'm not criticizing the near-term work (stuff that can be done in months). But one doesn't want to lose sight of a longer term vision that might require more significant changes, software- and hardware-wise. Numerai has decoupled the problem. Let the users supply the software and hardware.

Thomas
The less able among us (me) need Keras on top of Theano / TensorFlow

Alan Coppola

Probably a kickstarter project here for a new company to supply zipline-based, gpu-accelerated docker instances on the AWS/Google/MS clouds,
for the High Performance Computing version of Quantopian!
That way, you could use all the computing power you could afford...and bust past the 50sec of computing time limit for handle_data.
The Q research platform is somewhat like that, but my cut is that a free platform that you can't modify or request timely feature additions
will never satisfy your thirst for deep learning!

So...how much are you willing to pay for such a platform?

I googled "deep learning docker platform", and found that
....you could start here... https://github.com/saiprashanths/dl-docker
https://github.com/emergingstack/es-dev-stack
...or here... https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
...or, if you want to get someone else to do it for you, here http://www.somatic.io/
or just go Datalab or somesuch...
https://cloud.google.com/solutions/machine-learning-with-financial-time-series-data

alan

I would gladly pay usual server rent to have a dedicated server in quantopian server room if that server would have all the Q data available and I could have my personal work horse to run all kinds of peculiar CPU intensive ML algos.

I have used quite a lot ML (lots of genetic algos and ensembles) in the past and I can say that at the moment the Q system is totally unusable for any serious machine learning because of time/space constraints. Real world ML training will take days to train and will take considerable amounts of memory (you want to have full dataset in-memory during training).

https://www.quantopian.com/posts/zipline-in-the-cloud-optimizing-financial-trading-algorithms

Sep 13, 2016

Yep...near-term, Thomas W. and his pals can (and probably should) see how far they can push things under the current paradigm, but it would seem that the real nut to crack is a fundamental de-coupling of the data, software, and hardware. There is a systemic issue, in that the Q business model is built around attracting talent to work for free, so that they can harvest their output for their fund. Personally, the idea of paying for this privilege would be a non-starter. My intuition is that given commodity HPC hardware and free HPC software, needs a strategy. Their competition in the hedge fund world is ahead of them, no doubt.

@ Thomas -

I don't see how parallelism helps here.

Recall your work:

It is still relevant, right? Or maybe unleashing such tools on the masses would just compound your problem of over-fit algos being submitted? : )

Miles Garnsey

Sep 13, 2016

There are good architectural solutions which solve both the issue of libraries being available as well as having a more customizable hardware layer.

Surely if every user ran their apps inside a docker container on quantopian's cluster (with all network fully controlled) this would allow you to isolate potentially malicious or unstable code while also allowing them to select the nature of the server they wanted the container to run on?

Deep learning probably won't be possible until you can get gpu resources and a more flexible stack generally.

Hi Thomas -

The basic idea is to have a bunch of daily factors (we have a pretty solid library of economically sensible factors at this point).

Presumably you are talking about pipeline. One thing I never understood is why pipeline runs on daily data. I understand the argument that you can only take in so much data, but wouldn't it be a better practice to use minute data, smooth, and then decimate down to a daily frequency? For example, I've been using:

    prices = data.history(context.stocks, 'price', 390*context.N, '1m').dropna(axis=1)  
    context.stocks = list(prices.columns.values)  
    prices = pd.ewma(prices,com=390)

I don't decimate prices but it would simply be another line or two of code.

Using daily OHLC values would seem to make it challenging to sort out anything on a daily/weekly time scale, since one only has a handful of noisy data points. To get out of the realm of small sample size analyses, you need a month's worth of data. One ends up in the realm of monthly/quarterly time scales, just based on the way you are sampling the market (a single set of individual trades per day, OHLC). Intuitively, I don't understand why you would sample the market at 390 points per day (actually more, including minutely OHLC), and throw out 389 of the points entirely from your analyses. In the end, it shifts the time scale way out for a given required signal-to-noise ratio, right?

Doesn't it rather depend on the timeframe you are trying to forecast? If you are running a system which re-allocates intraday then agreed. But if you a running a system which runs monthly or weekly (and tries to predict / classify next week's / month's closing price as up/down ) then as per "normal" TA and indeed ML (https://www.quantopian.com/posts/applying-deep-learning-to-enhance-momentum-trading-strategies-in-stocks-45-dot-93-percent-annual-return) you would probably want to use daily and or monthly returns?

Anthony -

There are a couple issues at play, as I see it:

Sampling rate. At 1 sample per day, the absolute minimum period is 2 days. I need 1 day to set a baseline, and then the next to decide if something is changing. Draw a straight line, and bet on the third day if the slope is high enough. Fine if there is no noise.
Noise. If my data are noisy, then it may be impossible to grab two data points, to determine a trend with any reliability. However, if I can reduce the noise, then determining the trend may be possible. For example, rather than taking the last trade of the day, I average 390 of them. Then, I have a representative price for the time period of interest.

The same principle applies, regardless of timescale. If I'm trying to make forecasts on a monthly timescale, with 20 daily closing prices, and they are too noisy, I'll need to throw in the towel, and try a longer timescale. But if I can beat down the noise with averaging (e.g. over minute data), then I'll have an easier time resolving changes. Perhaps I still work with 20 data points, but the error bars are much smaller. And hey, with good signal-to-noise, maybe I can trade every two weeks? Every week?

So, what I don't understand about pipeline is why it doesn't allow for using the minute data owned by Quantpian? The mechanics seem simple enough: smooth the minute database, and decimate down to daily then input into pipeline, same as for daily bars. It's trivial outside of pipeline; I use something along these lines in the algo I'm working on. Why work with the last trade of the day when you have a whole day's worth of trades? Seems like there may be money being left on the table (under the assumption that the smoothing and decimation wouldn't cost much relative to the gain).

My intuition, anyway...

Point taken re averaging intra day data. If it's there why not be able to access it. I guess I always make the "mistake" of thinking very long term like in that Takeuchi report.

It would be interesting to know what use Winton et al make of machine learning. Or rather what profit they make from it and on what sort of timescale. They put a competition out on Kaggle a while back and provided simply data like number.ai

God knows. I wonder if ML will actually or can actually get us any further that more conventional old hat methods. Only one way to find out I guess. I'm still very bogged down in the learning and conceptual stage. I guess I have always been over thorough!

Sep 15, 2016

One step in the right direction would be to sort out how to use all 5 data points provided per day: OHLCV. Unfortunately, the convention is not to retain the timestamps of the OHL values; and unless one assumes continuous trading, the timestamp of the C value is not known precisely (e.g. for only 1 trade in a given day, it could be the same timestamp as the OHL values). In any case, is there an accepted "best-practice" way of assigning a representative daily price and its standard error, using the OHLCV bars? Or alternatively, does one write the ML algorithm to just take in the OHLCV and it'll sort it out?

Again, my feeble intuition here, but just using the C values (or the O values) doesn't sound like the right approach, when one has a much richer data set of Y's, driven by the X's. The thought is that if the noise can be reduced in the Y's by using all of the available data, then forecasting may be easier.

Sep 15, 2016

Grant,

I may bz badly mistaken, but the problem with using minutely data may be that the volatility is much larger on this time scale than on the daily one, so it is not clear what is gained by using such data.

The Wiener process is "self-similar", i.e; behaves in the the same way on all time scales ...

Again, I may be very wrong.

Sep 15, 2016

Tim -

My understanding is that the daily closing price is the price of the last trade of the day (i.e. it is the closing price of the last minute bar). It should be just as noisy as minute data; the error bars on the daily close should be similar to those throughout the day (unless there is something special about the last trade).

Maybe my intuition is wrong, but I'd think that if I wanted a representative price for the day, I'd use data from the entire day and compute summary statistics, rather than daily bars. Then, when I compute day-to-day returns, for example, I'm not subject the noise inherent in individual trades.

Grant

Sep 17, 2016

My personal experience from machine learning is that higher timeframes tend to work better than mid-level timeframes. My gut feeling is that low-level timeframes could be predictable via some good ML algo but I have never experimented with tick level level 2 data (unfortunately never had the data/resources to do that..).

Sep 17, 2016

Definition of higher / lower time frames? Daily is higher minutely is lower?

Sep 17, 2016

That's about what I meant. Predicting from 1d data seems to be much easier than predicting from for example 1h/1min data (which both I consider to be "mid" level data)

Low level would be tick level/max sec level data.

https://www.quantopian.com/posts/introducing-the-pipeline-api#561bd543d7c60be42d000595

Predicting from 1d data seems to be much easier than predicting from for example 1h/1min data (which both I consider to be "mid" level data)

Mikko -

That's counter-intuitive. You are suggesting it would be better to work with daily OHLCV values, instead of say, daily mean price and its standard deviation (or some other daily summary statistics)? Or alternatively, smoothing/fitting the higher frequency data on a rolling basis, it would seem to provide a better basis for making predictions. But then, maybe the large overnight/weekend/holiday gaps muck things up?

I'd be interested in counter-arguments, but I just can't see how more data wouldn't help, assuming it is pre-processed and analyzed correctly. Say my goal is to trade once per week. Suppose that I have determined (by some miracle) that by fitting a straight line to the trailing price data, I can forecast (extrapolate) the price trend for the upcoming week. I could either use 5 daily closing prices, or 5*390 = 1950 minute bar prices. Unless there is something special about the daily closing prices, I should be able to determine the price trend slope with greater accuracy and at a higher precision with more data (up to a point).

Thus, my sense is that Quantopian may be "leaving something on the table" by focusing on daily bars exclusively, due to the limitations of pipeline. For reference, Scott Sanderson provided a nice discussion in his post:

My sense is that the Q team needs to think in terms of a flow from a relatively large universe (e.g. Q500US or Q1500US) and low-frequency data (e.g. daily OHLCV bars), to a smaller universe (e.g. my_special_universe) and higher frequency data (e.g. minutely OHLCV bars). Pipeline can output a first-cut, and the result can be refined using data from history, rather than basing the entire strategy on pipeline and daily bars (convenient, but potentially costly, in that less profit will be generated). In my mind, if ML is the tool, then the question is, how would ML be applied within pipeline and again in the next stage, using minute bars, within a viable trading algorithm?

But maybe the model is that most factors, other than ones based on so-called TA/chart reading, will be derived from data sources that are low-frequency (e.g. fundamentals, quarterly/annual reports, etc.), so in the end, thinking in terms of minute bars won't pay off? The timescales of the factors are so high that daily OHLCV bars are more than adequate?

But maybe the model is that most factors, other than ones based on so-called TA/chart reading, will be derived from data sources that are low-frequency (e.g. fundamentals, quarterly/annual reports, etc.), so in the end, thinking in terms of minute bars won't pay off? The timescales of the factors are so high that daily OHLCV bars are more than adequate?

Yes, although an additional approach ( and I guess many additional / different approaches are needed) would include key commodity prices such as oil and gold where at least daily closes are available or indeed higher frequency.

Huge amount to be thought about, in particular techniques like LASSO and PCA which may help with reducing the dimensions necessary to make a classification or forecast.

Or which alternatively may in fact lead to overfitting to data where no true causation or lasting relevance exists or will persist.

I will be most interested to try to make some judgement going forward as to whether ML has any greater application to markets than TA or any other statistical or quasi statistical technique.

Or whether a curve fitted monstrosity is equally likely to be the result.

Do "patterns" have any relevance to "chaotic" systems like the weather or financial markets? Well there are certainly repeating "patterns" in the Mandlebrot Set and the Lorenz Attractor.

Ulrich Undecided. Peregrine Puzzled.

Been fiddling around all evening with this algo:
Simple Machine Learning Example Mk II

As I suspected the temptation is to run a large number of tests with different parameters to try to improve the thing. Increasing / decreasing the number of trees, ditto the lookback, changing the inputs.

Does one end up with a curve fit monstrosity? Well I can see its going to be every bit as easy to do that with ML as with any bog standard TA or statistical technique.

Must try harder not to optimise.....blah blah....?

...help with reducing the dimensions necessary to make a classification or forecast.

One area that might be low-hanging fruit would be to build on the Q500US/Q1500US general universe concept, and generate some point-in-time specific universes that users would not be able to generate using the tools at their disposal, due to computing constraints. As Thomas mentions above, one has:

2D matrix (n_stocks * len_lookback_window, n_factors)

So, instead of n_stocks = 500 or n_stocks = 1500, one could imagine n_stocks = 50 or n_stocks = 150 -- an order of magnitude reduction.

As a side note, I posted on https://www.quantopian.com/posts/the-q500us-and-q1500us that it is not obvious that the universes are pre-computed, point-in-time, based on the relatively slow execution speed of an algo I posted there...mysterious.