This is part 3 of our series on ideas from the book Advances in Financial Machine Learning by Lopez de Prado. Previous posts:
- Part 1: Introduction to "Advances in Financial Machine Learning" by Lopez de Prado
- Part 2: Data structures for financial machine learning.
Labeling data is the practice of assigning classifications to your training data, such that the model you are training can assign, or predict, labels to new data. An example would be an image recognition program that tries to determine what animal is in an image. The training set must be a large collection of labeled photos of animals; every picture of a cat labeled as a cat, every zebra as a zebra, etc. This allows our machine learning techniques to derive a general model for each class (uniquely labeled animal). Once trained, the algorithm will assign labels to each image with varying degrees of confidence and the result with the greatest assigned probability will be checked against the true class. If the model achieves high accuracy here, it will then be used against truly raw and unlabeled data. This is the phase that Facebook's facial recognition algorithm is in, as it assigns labels (names of people) to every photo you upload.
Methods that rely on labeled data generally fall under the umbrella of 'supervised learning' (so called because the learning process involves validation against some objective truth). There do exist inference techniques using unlabeled data, called unsupervised learning, however they are generally less powerful. Max Margenot provides a quick explanation in a YouTube video here. So, we clearly want to use supervised learning methods, and that requires labeled data. How should financial data be labeled to best be fed into a machine learning algorithm? What classes do we want? What would be a useful way to apply these classes to raw, live data to determine if a trade should be made?
Adapted from Dr. Lopez de Prado's 'Advances in Financial Machine Learning,' this notebook presents an intuitive method to classify directional time series returns while maintaining the notion of stop loss and profit taking limit orders: The Triple-Barrier Method.
The 'Triple-Barrier' refers to the strategy's two horizontal and one vertical 'barriers' (price or time-based sell signals). The horizontal barriers are multiples of an exponential moving average standard deviation of prices (long-term approximate daily volatility) and the vertical barrier is set a set number of days after the purchase date.
The method outputs a value in the range of -1 to 1 for each purchase-date and security given, depending on which barrier is first hit. If the top barrier is first, the value is set to 1 because a profit would have been made. If the bottom barrier is reached, losses would be locked in and the value is set to -1. If the purchase times out before either limits are broken and the vertical barrier is hit, the value is set to (-1, 1) scaled by how close the final price was to a barrier (alternatively, if you want to label strictly sufficient price changes, 0 can be output. Experiment, see what makes sense for your algo).
This method accepts as input a price series on some security and assigns the appropriate label assuming a security had been bought on a given date. It then iterates through all bars, up to the vertical bar a set number of days from the starts, and checks if a barrier was crossed. It then assigns the appropriate label as described above. This amounts to one observation, that purchasing this specific security at that specific time would have been profitable or unprofitable with those limits assigned. We then perform this same process for every possible trade initiation on every security in our tradable universe. See our previous post on how these trade initiations can be selected. Note that this method is extremely computationally intensive, and Dr. Lopez de Prado recommends using parallel computing techniques and literal supercomputers at United States National Laboratories. Unless you happen to have access to the latter, there are some options to simplify our problem. We can use lower frequency data (days, or alternative bar types based on a daily frequency), examine a smaller universe of stocks, perhaps artificially limit it to using every other timestamp to analyze, etc. We can also implement basic parallel computing techniques fairly simply, even if you're working on a laptop. Unfortunately, if you want to stay strictly in the Quantopian research platform for the development of your algorithm, you're going to have to cope with long runtimes.