Data Structures for Financial Machine Learning
This is the second post in our series exploring Lopez de Prado's book "Advances in Financial Machine Learning". If you haven't yet, read the Introduction to "Advances in Financial Machine Learning" by Lopez de Prado.
Also note that these are really just explorations of these methods and how they can be implemented on Quantopian. They do not constitute prescriptions of how to develop trading algorithms on the platform, but might some day lead to that.
1: Introduction
Our goal, at a basic level, is to predict the future. This is a task of immense complexity, and is essentially impossible in the classical sense. A much easier endeavor is predicting what is. An example of this is facial recognition; based on a model of humans and individuals, we can detect and label faces. This model is constructed with a series of parameters that likely have little intuitive meaning on their own, but combined can recreate any human face.
Now imagine that we live in a science-fiction world, millions or billions of years in the future. The human race has evolved in... interesting ways, and we now have distinctly different faces (4 eyes, trunks for noses, take your pick.) A facial recognition algorithms from the 21st century would be useless, as the true parameters for humans have changed over time. In financial data, this process is rapid and neverending. This is why traditional out-of-the-box machine learning algorithms struggle on financial data; they learn what is, not what will be.
There are two key solutions to this problem; training on nearly stationary data, and rapid iteration of the the validation, implementation, and decommissioning process. Facial recognition algorithms pick up on factors that are relatively consistent for individuals over time, such as bone structure, rather than those that change over time, like facial hair or complexion. In financial data, these variables are less obvious, but Lopez de Prado has presented strong arguments that they can be found. As for the rapid iteration, why would you keep running an algorithm whose alpha has decayed to nearly zero over time? Back to the facial recognition example, it might be time to re-train your model if the accuracy has dropped after a few million years of human development. The situation is similar in finance, though with dozens of temporarily-profitable models running in parallel.
For the rest of this post, please see the attached notebook!