I found this blog post from the guys at BigML: http://blog.bigml.com/2013/03/12/machine-learning-from-streaming-data-two-problems-two-solutions-two-concerns-and-two-lessons/ which I think is very relevant for the Quantopian community.
Among other things, it makes the observation that there are two paths when trying to apply machine learning methods to streaming data (c&p from the post):
- Incremental Algorithms: These are machine learning algorithms that learn incrementally over the data. That is, the classifier is updated each time it sees a new training instance. There are incremental versions of Support Vector Machines and Neural networks. Bayesian Networks can be made to learn incrementally.
- Periodic Re-training with a batch algorithm: Perhaps the more straightforward solution. Here, we simply buffer the relevant data and retrain our predictor “every so often”.
On Quantopian most shared algorithms take the second approach and that is really what the batch_transform was built for where you can specify how often you want to retrain your model (refresh_period).
I think the first approach has a lot of potential as well. I sorta did a mixture of the two approaches with the HMM algorithm that uses the previously learned model parameters as a prior for the updating. Generally, Bayesian methods where you can take the posterior of your trained model as a prior for when you retrain ("yesterday's posteriors are todays priors") seem very amendable to this.