Labeling Data for Financial Machine Learning

Back to Community

posted Jan 16, 2019

This is part 3 of our series on ideas from the book Advances in Financial Machine Learning by Lopez de Prado. Previous posts:

Labeling data is the practice of assigning classifications to your training data, such that the model you are training can assign, or predict, labels to new data. An example would be an image recognition program that tries to determine what animal is in an image. The training set must be a large collection of labeled photos of animals; every picture of a cat labeled as a cat, every zebra as a zebra, etc. This allows our machine learning techniques to derive a general model for each class (uniquely labeled animal). Once trained, the algorithm will assign labels to each image with varying degrees of confidence and the result with the greatest assigned probability will be checked against the true class. If the model achieves high accuracy here, it will then be used against truly raw and unlabeled data. This is the phase that Facebook's facial recognition algorithm is in, as it assigns labels (names of people) to every photo you upload.

TensorFlow

Methods that rely on labeled data generally fall under the umbrella of 'supervised learning' (so called because the learning process involves validation against some objective truth). There do exist inference techniques using unlabeled data, called unsupervised learning, however they are generally less powerful. Max Margenot provides a quick explanation in a YouTube video here. So, we clearly want to use supervised learning methods, and that requires labeled data. How should financial data be labeled to best be fed into a machine learning algorithm? What classes do we want? What would be a useful way to apply these classes to raw, live data to determine if a trade should be made?

Adapted from Dr. Lopez de Prado's 'Advances in Financial Machine Learning,' this notebook presents an intuitive method to classify directional time series returns while maintaining the notion of stop loss and profit taking limit orders: The Triple-Barrier Method.

The 'Triple-Barrier' refers to the strategy's two horizontal and one vertical 'barriers' (price or time-based sell signals). The horizontal barriers are multiples of an exponential moving average standard deviation of prices (long-term approximate daily volatility) and the vertical barrier is set a set number of days after the purchase date.

The method outputs a value in the range of -1 to 1 for each purchase-date and security given, depending on which barrier is first hit. If the top barrier is first, the value is set to 1 because a profit would have been made. If the bottom barrier is reached, losses would be locked in and the value is set to -1. If the purchase times out before either limits are broken and the vertical barrier is hit, the value is set to (-1, 1) scaled by how close the final price was to a barrier (alternatively, if you want to label strictly sufficient price changes, 0 can be output. Experiment, see what makes sense for your algo).

This method accepts as input a price series on some security and assigns the appropriate label assuming a security had been bought on a given date. It then iterates through all bars, up to the vertical bar a set number of days from the starts, and checks if a barrier was crossed. It then assigns the appropriate label as described above. This amounts to one observation, that purchasing this specific security at that specific time would have been profitable or unprofitable with those limits assigned. We then perform this same process for every possible trade initiation on every security in our tradable universe. See our previous post on how these trade initiations can be selected. Note that this method is extremely computationally intensive, and Dr. Lopez de Prado recommends using parallel computing techniques and literal supercomputers at United States National Laboratories. Unless you happen to have access to the latter, there are some options to simplify our problem. We can use lower frequency data (days, or alternative bar types based on a daily frequency), examine a smaller universe of stocks, perhaps artificially limit it to using every other timestamp to analyze, etc. We can also implement basic parallel computing techniques fairly simply, even if you're working on a laptop. Unfortunately, if you want to stay strictly in the Quantopian research platform for the development of your algorithm, you're going to have to cope with long runtimes.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

7 responses

Kyle M

Jan 16, 2019

I'll have to read this in more detail later, but at first glance, it looks like a useful implementation of the triple barrier method!

Michael Matthews

Jan 24, 2019

Hi Anthony,

Awesome work. I am going to post a notebook in a later post with a more efficient implementation, but I have a few more things I want to check before I post it.

I had a couple notes regarding the get_daily_vol function you use. I believe Prado used this function in his book for calculating daily volatility at "intraday estimation points". I have not tested his function on intraday data yet, but if you are using daily data, it does not give the desired results. As written, the function is actually computing the exponential weighted volatility of 2-day returns (with the exception of weekends where it uses a 1-business day return).

In the attached notebook, I count the business bays between the dates used in the daily return calculations to illustrate this. I also included a simple modified version of the function to use on daily data.

Last, I think you just made a typo in saying that the volatility weighting is set to have a default half life of 20 days. However, you set the span to default to 20, which is approximately equivalent to a 7-day half life.

Let me know if you see any errors in my work/logic.

Mike

Michael Matthews

Jan 29, 2019

Attached is the notebook with the faster implementation of the triple barrier method. I have also included some tests that time the runtimes on different numbers of days and securities. The improvement is significant. For approximately one year of data and 300 securities, the runtime is about 5 seconds vs. 326 seconds (5.4 minutes).

A couple things I did differently in the fast implementation as compared to the original function:
- My labeling logic is different for if the vertical barrier is hit first. I simply take the percent location in the range between the upper and lower barrier. It is simple. This can be pretty easily modified if necessary.
- I do not calculate a label if there are not a full t_final bars in the window. This is likely how I would prefer the function to work.
- I take into account the possibility that both horizontal barriers are hit on the same bar. In this case, the function returns a dummy value of np.inf, which basically informs the user that it us uncertain which barrier got hit first.

I also modified Anthony's original function slightly to correct for the following bugs (let me know if I'm incorrect on these):
- The original function added an extra day to the t-final window.
- On the vertical barrier date, the original function only took into account the close price of the bar. In other words, price could theoretically have dropped 90% but closed back within the horizontal bands on that same bar, yet it would not have been labelled a -1.

If you run the entire notebook as is, it will probably take a few hours because the two speed tests for the slow implementation take a long time. (I also run multiple iterations when testing the speed, which can be modified in the code). For this reason, I recommend looking at the output before running the notebook.

Michael Matthews

Jan 29, 2019

Here is a helper notebook that includes some test data along with some additional print statements to help the user understand the logic in the new function.

Michael Matthews

Jan 29, 2019

Here is a notebook I used to compare the results of the two functions. There isn't a lot of explanation in the notebook, but it might help in case you wanted to do the same.

Kyle M

Jan 30, 2019

Michael,

Amazing work! Your code will be incredibly useful when I look more into the labeling aspect Lopez de Prado describes. Again, thank you!