Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
How to construct good training sets for supervised machine learning algorithms?

If I want to train a supervised machine learning algorithm to predict win/losses of a stock based on some fundamental data at the time a position is entered, I need a good training set. To construct the training set, feature vectors of fundamental data have to be mapped to discrete or continous win/loss target values. Now the question is: What is a good strategy to construct such a dataset? Say I take a set of fundamental data of a universe of stocks as feature vector, to which target values do I map these feature vectors? Or more precisely: Do I take the close of the next trading day, or the close of next week / month etc. as target values, or an average of all these?

Or in the pseudocode

for each day in history  
       for each stock in universe  
              feature_vector = CalculateFeatureVector(day, stock)  
              target_calue = CalculateTargetValue(day,stock)  
              AddInstanceToTrainingSet(feature_vector, target_value)  

I am looking for good strategies to implement

 CalculateTargetValue(day,stock)  

I am sure this is fairly gerenal question, with answers that start with "It depends..." . But what I am looking for here, are some rules, strategies best practises and/or examples that may help to construct supervised training sets.