When testing single factors, it seems like best practice would be to reserve a “test set” for the final model once multiple factors have been combined. This way, you can be confident that your model has not “seen” the out-of-sample data. However, without seeing the performance of individual factors in the test set, it becomes difficult to distinguish between robust factors and overfit/decayed factors?
If the model performs poorly, one may be tempted to throw out all factors in the model in future development efforts in order to avoid multiple comparisons bias. If the model performs well, the poor factors that are weighing down the model may be kept even though they have no predictive value. What is best practice in striking this balance between getting good individual factor attribution data in the test set vs. keeping a clean holdout set for the multi-factor test?
(I elaborate on this question below in hope that it will be a bit more clear. I also propose a possible, albeit unsatisfying solution to the problem.)
A simplified process of developing a multi-factor model may look like this:
- Develop and tweak a single alpha factor on a “training set”.
- Repeat for any number of factors.
- Combine factors in training set.
- If it performs well, run combined model on a “test set”.
After step 4, if the model performs poorly, you must throw out the model and go back to the drawing board to avoid multiple comparisons bias. However, if you start from scratch, you risk throwing out good factors that were being weighed down by other factors that either decayed due to arbitrage or were overfit in the first place.
An alternative would be to test each factor individually on the out of sample data (after tweaking and testing on the training data). This way, you keep factors that hold up and throw the others into the trash. However, once you move on to the factor combination step, it will most likely look good no matter how you combine the factors because the individual factors have already “seen” the out of sample data. Therefore, you have lost the out-of-sample data for the factor combination step.
If this alternative approach is taken, maybe it makes sense to create a higher benchmark for the factor combination step to beat. Instead of raw risk-adjusted performance, maybe the benchmark is an equal weighted combination of the individual factors. Any combination methodology (mean-variance optimization, machine learning, etc.) must beat the equal weighted combination in the out of sample step. (I’m not sure this approach is entirely satisfying either.)
In the end, my question is the following:
How do you balance the trade-off between 1) keeping a clean out-of-sample testing set for the final factor combination/algo backtest step, and 2) the need to have individual factor attribution data, so that you don’t keep using decayed or data-mined factors in future models, and/or so that you don’t throw out good factors because of a poor multi-factor test?