Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Backtesting - Evaluating Algorithms - Are we doing it the right way?

I have been thinking about the way Quantopian does backtesting/benchmarking to understand and see how it can be improved. I say this because, a lot of the algorithms at first glance of the backtesting graph and return comparison sounds spectacular, but on doesnt' stand closer scrutiny.

What do yall think?

So I started looking at what the problems were. I have done a fair bit of trading and algorithm development outside of Quantopian and I think I have a decent understanding of the problem.

Issues:
1. Leverage - Many algorithms have wild leverage. Though in theory it starts with a $10000 initial the algorithm seems to run wild buy/selling way more than the portfolio.
2. A small section of the backtest with good returns can mask mediocre or bad returns for the rest of the back testing period or vice versa
3. The effect of portfolio size over time, meaning though you start with only $10000 initial as the portfolio size grows/reduces in size of the later trades becomes bigger/smaller. I kinda tend to look at this as skewing in some way to the profit/loss expectation of the algorithm over time.

A good way to evaluate algorithms is to first evaluate all your trading signals separated from the impact of portfolio management.

That is evaluate all potential trade signals as a collection of bets with a probability of success/failure with an associated profit/loss potential. Look at all the buy/sell/hold signals as a collection of trades(along with stop-loss rules) independent of portfolio size with/without commissions. Ignoring the fact that consecutive buy/sell signals on consecutive days/bars for the same stock may not actually translate to trades in the real world running of the algorithm. But we look at all of them as whole.

And we evaluate each trade as a constant sized bet(with/without commissions) and understand the size of profit or loss for each trade

Then understand the statistical probability of profit or loss as well as the "Expected" size of profit or loss.

Compare that to the equivalent trades/numbers of the benchmark.

And also understand these numbers separately for up markets and down markets.

If we were to evaluate the reliability of the algorithms signals, and that would give us a good understanding of the trading signals that we will be working with. This would give a true understand of the strength of an algorithm.

The next phase would then be to apply portfolio management rules.

Now I don't expect Quantopian to implement all of this and every one evaluates algorithms differently and may be even in a proprietary way. But what I am suggesting with this thread, is a way that I as a developer can calculate and display these various custom statistics both in table and/or graphical forms as well as custom summary statistics. Like,

  1. Extensions to record() to show custom graphs
  2. May be additional API to record to a table format display
  3. As well an API to display custom summary statistics for the entire backtesting

my 2c,
Sarvi

PS: A good reference for everything I am talking about would be Trading Systems and Methods by Kaufman.

4 responses

I work in developing algorithms and evaluating the significance of the backtesting results. Namely in developing algorithms the primary objective should be to remain skeptical of the results. As you said there are many factors that can skew the results, leaving many individuals to toss money into a strategy that they fitted. As far as your concerns go:

1) Leverage: please take a look at some code Daniel Sandberg and I worked on to address this. I started a framework to address this in a Long only portfolio, and Daniel extended it to support L/S. I am really hoping to see this evolve with the community, because I agree that the unrealistic strategies on this forum (while fun) have no merit in the real world.

2) Lucky Trades: Strategies can definitely be inflated by a single unrepeatable trade, to combat this I often will look at my equity curve and check for huge spikes. These tell me my algo got lucky and took advantage of a few unrepeatable events in history. Optimally I want a smooth equity curve that informs me that my idea is valid across all market conditions. Granted this is merely a qualitative method to approach this. Furthermore you can look at the trailing returns of your strategy vs the benchmark and calculate the percentage of times you surpassed it, this should also help to remove outlier performance. Alternatively you could look at the descriptive statistics to winsorize your results to get a better picture of your equity curve, essentially this is just removing trades that performed some threshold above average. Also I do like to look at event studies of my strategy to see if my idea has any kind of merit by seeing the average of all my trades over some length of time

3) I kinda disagree with you here... The idea should be to scale your trades with your account value, if you have a $1B account does it make since to trade the same lots as a $100k account? If your account is growing you should be increasing your trade size, this is how you achieve pseudo exponential growth in your equity. While a $100K will be sensitive to $10K equity at risk, the same dollar value risk is not going to make much of a percentage difference in the $1B account.

I would also like more freedom in what kinds of plots I can get out of the backtest. Namely additional subplots, as well as the individual securities with my entries and exits.

I am not saying the current backtesting curve doesn't help. I am just pointing out that it is not easy detect the issues mentioned above.
2. The plain benchmark backtesting graph does't help understand this. I would rather see a graph that tells me how consistently my signals beat the benchmark in 80 out of a 100 trades(by even 2%) rather than it show that I have beat the benchmark by 150% over all even though this lead was established 5 years back during a 1 year period where the algorithm performed exceptionally better than the benchmark and then just tracked it before and after it.
3. I don't disagree we should be scaling our trades with the account value in a real trading algorithm implementation. Its just that when analysing and compraring how the trading signals performed, it should not be impacted by the size or growth/contraction of the portfolio during the backtest period. Instead I would prefer if we bet the same excact amount from the begining to the end of the backtest period so that we can understand how the bets performed during the entire period.

Sarvi

Regarding #3, as you guys mentioned, it's also an issue comparing it to the backtest. For example: My algorithm trades all the money it can while not overspending. Let's say my algorithm has a 50-day warmup period. During that time, the value of my portfolio remains flat since I own no shares, but the S&P has fallen 5%. So when my algo starts trading, it already has an advantage over the benchmark. If my algo performs similar relative to the benchmark from here, then my graph will look better compared to the benchmark graph because the average slope of each graph will be proportional to the portfolio starting size which is 5% more for my algo. Not to mention, obviously my algo starts 5% farther ahead than it should. This can also occur if my algorithm just gets lucky early on then moves similar to the S&P. I think a lot of people have seen this, and it is misleading since they see diverging benchmark returns and algo returns lines that make a strategy look nice even though it has average performance.

That said, this is a problem that can be mitigated within code, and it is up to people to do that themselves right now. Quantopian could add tools or features to automatically avoid problems like this, but I think another option is having users build frameworks that other users can easily include in their algorithm. I'm not sure what the most elegant solution is, I think there are a few good routes to take, especially with something very flexible like Python.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Sarvi,

Let me reconvey my advice in regards to comparing your performance to the benchmark, as I think I might have not explained this properly. I am saying the general shape of the equity curve is a good qualitative measure that you should first look at. Here you can spot any events that you backtest luckily took advantage of that resulted in an inflated return by looking for huge spikes. Next you need to actually evaluate the frequency in which you beat the benchmark, you can look at the performance breakdown numbers in the backtest page and tally this up by hand. But better yet why not just automate the process. Simply code up a solution, it would not be hard to keep track of the 252 day rolling compounded returns for both your portfolio and your benchmark. From here with each new calculation add a tally if the % portfolio > % benchmark and there you can show if your portfolio consistently beats the benchmark. Of course you don't have to use 252 days but rather whatever is applicable to you.

I still am not on the same page as you on #3 however, as you could simply just look at the average return on a trade and disassociate from the dollar value. Furthermore you might want to step out of quantopian and use an event profiler to see if your strategy has any merit, this will graphically show you how the average trade performed including the deviations.

@Gus I would love to see an additional forum page for users to share code regarding the infrastructure of their backtests as opposed to a strategy, I think this would provide an avenue for those who might not be likely to share their strategy to share code that addressed caveats and gotchas within the library. Then each user could implement tidbits of code into their personal framework that would improve the information gained in their backtests. Additional useful feature would allow users to set a default environment in importing their framework(s) whenever they create a new algorithm