Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Feature scaling for multiple assets with sklearn/random forest

Hoping for some advice in regards to training a single Random Forest classifier model on 200+ assets, that will predict the direction of the next bar/candle. This is a learning experiment for me where accuracy is not the primary goal - i'm more interested in how to tackle the data processing for building a single generalised model for multiple assets.

I have a dozen or so features which are derived from price (simple technical indicators and statistical calculations), the label/prediction is the direction of the next candle/bar.

  1. What is the preferred way to scale the data (standardize/normalize/log etc) across all the asset features? The price, volatility, and returns are wildy different between assets eg. Asset 1 is $0.30 with +/- 10% stddev of returns, and Asset 2 is $2500 with a +/- 2% stddev returns.

Many thanks,
J

4 responses

You can use sklearn StandardScaler but you shouldn't need to. A Random Forest will divide your data to maximize information gain

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Thanks Robin. If i didn't standardize as suggested, and left price (or returns) as absolute values for all 200 different asset classes- wouldn't that create a large number of unnecessary splits within the tree? On a single asset class it would be fine not to standardize with RF, just not sure with 200?

I'm not sure what you mean - can you share some code?

@Robin can you share some code on how you use random forest?