Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Trouble Porting Machine Learning Strategy to Quantopian

Hi all,

I recently learned about Quantopian after taking a data mining course in university. I've been developing a screener, or classifier, for predicting daily stock movement and have seen some really great results so far. In order to understand how well the predictions would factor into a trading strategy, I've built-in a simple strategy that buys at market close if the stock is predicted to go up tomorrow or sells at market close if the stock is predicted to go down tomorrow. Then, the next day, it undoes the position from the day before, makes a new prediction, and either buys or sells based on that prediction.

My motivation for coming to Quantopian was to both get a more rigorous calculation of return (incl. commission fees, slippage/market impact of order) and to hopefully use the platform for live trading in the future.

My process is rather simple:

I build a pandas DataFrame using get_data_yahoo for every ticker I want to experiment with. I append multiple columns using pandas transforms, simple algebra, boolean expressions, etc. (most of which are essential to my model, so I will need some way to bring my custom data frame function or rebuild the exact columns i have been working with in Quantopian). I build a scikit-learn model that trains on certain columns in the DF and tests on those columns, comparing the prediction vector to the true value vector. I have stored prediction vectors for historical data (1 for up, 0 for down) for multiple stocks. I have text files that contain pickled sklearn models--these are able to make predictions on current or future data. I just don't really know how to get started within Quantopian. I've read through the API & Documentation multiple times and even tried messing around with zipline in my development environment. No luck so far in getting the big picture.

Any guidance would be very much appreciated.

4 responses

Spencer,
It sounds like what you are doing will be possible to do in Quantopian. The trick will be to make all of your external data available to the backtester. You will have to format your data in csv tables and use fetch_csv() to import it.

Quantopian does not support pickle, so you will have to find a different way to handle the sklearn models, maybe create them within the quantopian API as sklearn is supported.

It would be helpful to see how the data is formatted and any code you don't mind sharing, that way we can give you more specific suggestions.

David,

Thanks for the response. I just tried explaining the situation further and posting some code, but I realized that the biggest issue i am facing going forward is a lack of understanding of the functions initialize(context) and handle_data(context, data) especially with regards to how my developments so far will play into them. Going further, I don't think I grasp what context is and what it can be used for. Can data be any two-dim array (ie, can i use my custom generated DataFrames as opposed to some matrix generated within quantopian?). Those two functions and their arguments are the most basic blocks of using quantopian and I don't feel that they are explained fully in documentation.

Again, any help would be appreciated. If someone can point me to more reading or even back to the documentation in case I missed anything, it would be helpful.

I'll try to give a down and dirty explanation of how the API works.

The api is 'event' driven. When an event occurs, it calls a function with the details of the event to see if any orders should be placed based on the new information. The function that gets called is handle_data(), and the events are the 'data' variable passed to handle data.

The initialize function is called once at the beginning of a backtest. It is a place to set things that will be used throughout the backtest, like commission and slippage models.
The context object passed to initialize is a convenient place to store any parameters that will be used throughout the test, it gets passed to handle_data at each bar and contains the up to date portfolio as well.

The idea is that handle_data gets called every time a new bar of data becomes available. The data passed by quantopian contains the prices,volumes, etc. for that bar, but the fetch_csv() function can be used to import data from elsewhere so that it can be added to the data variable. The external file will need to be a csv file with a 'Date' column so that it knows which bar to pass at each call to handle_data.

handle_data and initialize are the only two functions that have to be defined in Quantopian, but you can make all the functions and classes you want beyond those two.

To answer your question, no, you cannot override the data passed to handle_data, but you can add to it using the fetch_csv function. Hopefully this wasn't too simplistic of an explanation.

Agree with David. Just did a very simple example to show event driven thing. As show is the log, the initialize function is like setting up all the background information for the algorithms. And for each day or every minute ( depends on what mode you choose ), the handle_data() will be called.

1970-01-01initialize:4DEBUGThe algo is starting and this function will run once  
2008-01-04handle_data:13DEBUGThe handle_data is being called for the 1 times  
2008-01-07handle_data:13DEBUGThe handle_data is being called for the 2 times  
2008-01-08handle_data:13DEBUGThe handle_data is being called for the 3 times  
2008-01-09handle_data:13DEBUGThe handle_data is being called for the 4 times  
2008-01-10handle_data:13DEBUGThe handle_data is being called for the 5 times  
2008-01-11handle_data:13DEBUGThe handle_data is being called for the 6 times  
2008-01-14handle_data:13DEBUGThe handle_data is being called for the 7 times  
2008-01-15handle_data:13DEBUGThe handle_data is being called for the 8 times  
2008-01-16handle_data:13DEBUGThe handle_data is being called for the 9 times  
2008-01-17handle_data:13DEBUGThe handle_data is being called for the 10 times  
2008-01-18handle_data:13DEBUGThe handle_data is being called for the 11 times  
2008-01-22handle_data:13DEBUGThe handle_data is being called for the 12 times  
2008-01-23handle_data:13DEBUGThe handle_data is being called for the 13 times  

The only problem here is that the pickled sklearn models will be pretty hard to import.