Backtest failing constantly and randomly

Back to Community

edited Feb 8, 2016

I'm trying to run a backtest for an algo. It fails some times, some times it works and this is totally random. It seems that webworker is killed during initialization (this can happen during running testing too, sometimes after quite a long time) and the screen just hangs. No error messages nothing. What's worse this seems to have something to do with the time of day meaning (I assume) it crashes silently when there is more load.

Screenshot with firefox debugger that shows hanging dynos after trying to start backtest. Sometimes I get the same error during testing.
http://i.imgur.com/xcfxEQQ.png

I have used many different platforms for backtesting (including custom one) and Quantopian seems to be immature, slow and extremely unreliable. I didn't expect to be starting test after test like in some slot machine to get my test running (or failing silently).

Are there any plans to make backtesting more reliable or allocate more resources for backtesting so it would be a) faster b) not crash randomly?

17 responses

Alisa Deychman

Feb 9, 2016

Hi Mikko,

Sorry that you're running into errors with your backtest. Could you send us an email to [email protected] and let us see the code (or a skeleton structure) so we can diagnose? We'd like to help you get the algo up and running. Thanks!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Charles Piché

Feb 9, 2016

It may be unrelated, but I also have had problems running backtests lately. Random crashes on long 2002-2016 backtests (on old algos that didn't use to cause any error). Had to retry until it worked.

Mikko M

Feb 9, 2016

Same thing again. Because this happens during the execution (after 6 minutes if I calculate correctly), it has to have something to do with the environment. No errors nothing (some error codes would definitely help). It just stops, only thing that tells it's stopped is the debugger.

I think you can see the dyno ID from those screenshots and check your logs to see what's wrong? At least that's what I would assume reasonable devs could do?

http://i.imgur.com/R5YoV4e.png

if you want to take a peek of my code go ahead (I'm quite sure you can do that if you need to). You can see the algo name from the image.

Mikko M

Feb 9, 2016

Another screenshot, now I can't get it even running again. Constant socket errors.

http://i.imgur.com/ONIod6f.png

Dan Dunn

Feb 10, 2016

I took a look at this just now. One of the very helpful things about your screenshot is that the error message includes the backtest ID, so I can search our logs and find a lot of helpful information.

The short version: It looks like the problem isn't that the backtest stops running; the problem is that the browser is losing connection to our server.

The longer version: When you press "build" or "full backtest" we kick off a backtest on one of our backtest servers (the codepath for "build" and "full backtest" is identical). Then, the browser opens a websocket connection to our server. I'm not sure if you are familiar with websockets but they are a relatively recent feature in web browsers. The backtest data is streamed back to your browser, one day at a time, over that websocket.

That backtest keeps running until 1) it finishes successfully 2) it throws an exception 3) it is cancelled 4) some sort of traumatic server failure. The backtest will finish regardless if the websocket connection is disconnected - the websocket doesn't matter to the backtest server.

Your first screenshot had two backtest IDs in it, and I looked up what happened to them:
56b9012cce67a7118fa77ddf - cancelled
56b90153dd069a11c7148054 - exception

second screenshot had one:
56ba55c8434c141276f9b759 - cancelled

third screenshot had two:
56ba5ebcf5478412d6ec1e19 - cancelled
56ba5ee5d871eb1188b3b0cd - cancelled

What I infer from all of this is something like this:
1) you are pressing "build" to start a backtest
2) in 4 of the 5 cases, the backtest stops updating because the websocket disconnected
3) you were understandably annoyed with the lack of progress and pressed "cancel" which stopped the backtest on the backend
4) repeat

One way to partially verify my theory is for you to run a "full backtest" instead of pressing build. You may see the websocket disconnect error. However, with a full backtest you can reload the page, and the websocket will be re-established. You can reload the page an hour or a day later, and you'll get the full backtest. (That reloadability is true of full backtests, but is not true with the "build" backtests. "build" backtests can't be accessed except when they are first run). This test will verify that the backtest server itself is completing the job, and we can then focus on why the websocket is disconnecting.

Assuming that my theory is right, the real question is, why is the websocket breaking? That is, unfortunately, going to be a lot harder to troubleshoot. Over the last 3.5 years of supporting websockets I've seen a few causes.

an aggressive corporate or academic firewall that closes connections that it thinks are inactive
an older firewalll that handles websockets poorly
browser plugins, ad blockers, things like that
buggy or older browsers
slow or intermittent network connections

The things to try in order to figure it out:

Trying another browser - use Firefox instead of Chrome, or vice versa
Try another computer at the same location.
Logging in from another location - try work instead of home, or vice versa

The last-ditch method for debugging something like that is to put packet-sniffing software in and look for the network-level failure. Unfortunately even that isn't always successful because we can only listen to the two ends of the connection, and we can't see everything going on in the middle. But hopefully we'll figure it out before we get there.

Disclaimer

Mikko M

Feb 10, 2016

Thank you, this is interesting to know. I was not behind firewall at the time (it was from my home laptop so direct cable via wifi). I'll try to use another browser next time to verify that this is a browser bug. I have had no problems with the connection outside Quantopian (I use it to watch netflix quite often for instance so bandwidth should not be an issue)

I didn't know that full backtests are run on background processes, that's very interesting to know and will probably solve the actual problem if connections are the issue.

I checked firefox bugzilla and it seems firefox had some open webworker bugs during late 2015 so this might be the issue (version number I use is 43.0.2). What I find odd is that it's so sporadic, did you find that the hangups happened during high peaks of load on your side?

Also I want to verify if I understand correctly that you isolated the bug to be 100% because the browser hanged up the connection, not because some virtual machine bug that might cause issues like this in high load / memory usage conditions?

I want to verify this because I'm considering if I should invest time in using quantopian at all (I already decided against it some years ago and decided to take a second look now as you have live trading, competitions etc so I don't want to make hasty decisions).

Dan Dunn

Feb 12, 2016

Technically, both full backtests and "build" backtests run on background processes, but the difference is that there is no way to reconnect to a "build" backtest.

I haven't 100% proven that it is a connection problem. Getting a full backtest to run to completion would go a long way to verifying that the connection is the problem.

We have auto-scaling on both our front end and our backtest server array, and we monitor the load on them as well. There is no indication that this is a server issue.

Did you get the full backtest to run to completion? It will be informative either way.

P.S. I'm going on a long weekend, so [email protected] may get you a faster response.

Disclaimer

Mikko M

Feb 12, 2016

Yes with the full backtest I got the algo to completition multiple times so this is very probably some kind of client connection issue.

I still get random timeouts some times from history function but I'm aware that this is a separate bug.

Mikko M

Feb 12, 2016

I have one another question, I'm trying to use some machine learning methods and I'm building a regressor. This takes some time and depending on some random thing it fails sometimes.

Is there any way to get around this? Training a regressor takes time even if I have tried to keep samples and other values so low that it only takes much less that it actually should and as I'm using scikit-learn there isn't much I can do to skip the training to parts etc to get over function call time limit..

Blue Seahawk

Feb 13, 2016

Try processing only every nth element with a double colon like prices_dataframe_from_history[0::n], that is, like [0::4].
You're probably not doing this: All-390-minute-processing of 10 stocks for just 1 function taking 2/1000ths of a second on each would consume an extra 32 minutes per backtest year.
Timing code for detecting choke points.

Mikko M

Feb 13, 2016

Sure I could reduce sample size, the problem with that is that accuracy of the classifier (or regressor) goes down (I'm already using only every 20th point of dollaruniverse(99,100) daily data of 1000 days, this is some ~6500 samples). Same goes with different parameters of classifier/regressor. I can't split the training of the classifier to several parts as scikit-learn does not support that. Basic training goes like this:

classifier = some_classifier(parameters)
X = [samples]
Y = [results]
classifier.fit(X,Y)

The fit part is where the quantopian kills the whole process as it might take some time. With less samples I could make it run (I'm running with smaller amount of samples) but from experience I know that less samples means much worse results therefore the algo is not as good as it could be.

Mikko M

Feb 22, 2016

I'm getting a bitn annoyed at the backtester again. Now it's crashing for out of time error for history function (100 stocks gotten via pipeline, 500 days of data) whereas in the past I used it with 2k price points without problems. It really can't be a problem for a backtester to find 500 daily points for history for 100 stocks! (Algo does this processing only once a month)

I'm getting to a point where my frustration is forcing me going back to my old custom tester again..

You might want to either give more resources if someone requests it (I'm doing some pretty cheap machine learning at the moment as it's impossible to create any serious models due to cpu constraints) or consider giving resources for cash. I would be willing to pay (reasonable price) for rented personal instance where I could use more processing power.

Screenshot of the event.
http://imgur.com/kCVonza

Mikko M

Feb 22, 2016

I changed the stock list to hardcoded sp100 to validate that this is not about pipeline. The code timeouted at the same history line.

HOWEVER..

What is quite interesting is that the same code works just fine without timeout if I increase the lookup days to 1000 instead of 500 so increasing the history day count decreases the query time?

Burrito Dan

Feb 22, 2016

I am getting some timeouts today on things that have worked in the past. I've sent some info to [email protected]

Mikko M

Mar 2, 2016

Yesterday and today (just now) I have been getting a lot of this:

Line 108 is just:
out = pipeline_output('short_pipe')

FundamentalsQueryError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections
There was a runtime error on line 108.

It might be a good idea to raise connection limit for fundamental database..

And at one point I was getting "??? is shutting down" and "Unknown error".

Mikko M

Mar 2, 2016

Now the previous problem has disappeared but suddenly all my backtests stop at 1.4%, they all have been waiting there for at least 20 minutes. No timeout nothing algos just hanging.

http://i.imgur.com/dnUzfXu.png

edit.. And tadaa! after 30 minutes of waiting the algos are running again.

Mikko M

Mar 14, 2016

I'm getting extremely frustrated again, my 30 minute running backtest suddenly stops and throws exception.... As do EVERY OTHER test of the algo at the same time. What's the point of backtesting platform if getting backtest to run is so irrational. And this is an algo that I have tested before so this is not due to some my error, I just wanted to refresh the situation with latest data (and other tests with few variations).

Also why does my full backtest hang at the same point every time I try to run it (1.5%) but "build" test from IDE page goes past that. Is there some different server for full backtests and build tests or something?

You've successfully submitted a support ticket.

Our support team will be in touch soon.