Pipeline TimeoutException: any hope for a fix?

Back to Community

edited

We have had the issue of Pipeline timing out since forever and the main reason of this time-out is that Pipeline is run in chunks. The first chunk is only few days, so that Pipeline doesn't time out, but soon after the first chuck is completed Pipelines computes several months of data together and we start experiencing TimeoutExceptions. I might be wrong but the origin of TimeoutException was to limit the daily pipeline computation to a fixed amount of time. That makes sense if the computation is done day by day, but not with the chunks. Spreading the computation in several chunks is a clever optimization, but the way the time-out is handled makes impossible to write complex algorithms.

I believe the Pipeline TimeoutException should depend on the daily computation time and this time-out should scale with the size of the chunk the Pipeline is computing.

40 responses

Ivory Ant

Just ran into the same problem when trying out a really simple regression type algo. It just does a one variable regression for every stock in the QTradableStocksUS universe.

If a simple linear regression with one variable isn't going to work, not much will.

Edit: I thought the first thread I linked to contained a temporary workaround, but unfortunately this only works in the notebooks.

Are there any other workarounds for the backtest environment?

Delaney Mackenzie

Hello folks,

Could you post a simple algorithm here which times out but doesn't have any of your IP in it? Basically a minimum functionality reproduction of this issue. Or point me to one of the threads where someone did share a strategy which was timing out. It will help us understand what is slow.

Thanks,
Delaney

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Luca

Here it is. After pipelines computed the first chunk (a bunch of days) it failed computing the second chunk with many more days than the first one. My point is explained above.

Scott Sanderson

I think the issue here is just that the current implementation of RollingLinearRegression is needlessly extremely slow. The reason for that has very little to do with the numerical complexity of running a regression, it has to do with the the fact that it's implemented in terms of a loop that calls scipy.stats.linregress, which adds a huge amount of overhead (in particular, linregress creates a new nameduple subclass every time it's invoked, which performs a significant amount of work).

To put into perspective how much those overheads matter, the more recently added SimpleBeta factor (which does roughly the same amount of numerical work), is 100-1000x faster on my development machine for reasonable inputs (see https://github.com/quantopian/zipline/pull/2021 for details on this).

It's difficult for us to straightforwardly apply the same, optimizations to RollingLinearRegressionOfReturns because its contract is that it provdes many different outputs (it provides, alpha, beta, r_value, p_value, and stderr).

If you just want rolling betas, I'd highly recommend using SimpleBeta. As mentioned above, it's 2-3 orders of magnitued faster. If you're using the other outputs from RollingLinearRegressionOfReturns, things are a bit trickier. I have a PR open in Zipline that applies a similar set of optimizations to RollingPearson/RollingPearsonOfReturns, so that could be a potential replacement for r_value. If you're using p_value or alpha, we don't currently have a great solution, but I'd be interested to know if those calculations are important to you.

Disclaimer

Delaney Mackenzie

Here's an example of the same algorithm, but with SimpleBeta swapped in. It runs fine with no timeouts.

Disclaimer

Luca

@Scott, that is just an example, I would avoid focusing on RollingLinearRegression. Could you please answer to the more general Pipeline TimeoutException issue that I raised at the top of the thread? Do you believe the timeout makes sense as it is now? Why is not scaled to the size of the chunk computed by Pipeline? What is the point of terminating an algorithm after few minutes?

Leo M

Luca, I would recommend timing your algo before and after SimpleBeta change. The speedup is quite remarkable. I have been able to do quite a lot more after I made that change.

Grant Kiehne

I believe the Pipeline TimeoutException should depend on the daily computation time and this time-out should scale with the size of the chunk the Pipeline is computing.

Yes. If 5 minutes of compute time is allowed for live trading in before_trading_start per day, and a chunk is 2 months, then when the chunk is run, (2 months)x(20 days/month)x(5 minutes/day) = 200 minutes should be allowed for each chunk in backtesting. This is unrealistic, however. Each user can launch as many backtests as he wants, and if each one lasts almost forever, times 195,000 users...well, you get the picture.

Reducing the timeout of before_trading_start would be one approach, combined with scaling it up for backtesting. For example, go to (2 months)x(20 days/month)x(0.5 minutes/day) = 20 minutes, but then anybody who is counting on a 5-minute per day timeout in an algo gets hosed--their algo is broken.

Maybe just keep before_trading_start as-is, and create a new function, e.g. pipeline_compute with the timeout scaling with chunk size for backtesting? Some caps on total backtest time, and the number of backtests running in parallel could be implemented, too.

Seems like if there is a will, there is a way.

Luca

THANKS Grant, at last someone got my point right. I know it is unrealistic a huge value for the timeout, but please keep in mind that algorithms could have a global timeout too (1 hour, 1 day, whatever) and that would dictate the termination of the algorithm. Then it is up to the user deciding how to make use of the available time. Also if you are not using Pipeline you already have access to more computing time, so I am just saying that the current design is flawed and inconsistent from a user perspective (but I am pretty sure it makes sense from Quantopian's point of view).

Grant Kiehne

@ Luca -

Thanks for raising the point again. It is an architectural flaw, and I'm sure Scott realizes this. It would be good to hear the options for fixing it (eventually...I understand there may be other priorities, at this point). I have to think that the Quantopian Enterprise users will bump up against it, as well. So, why not get the problem out of the way before folks start paying for the platform, and get demanding. We're just free-loaders, so in the end, we can't really complain with any authority.

Alan Coppola

Folks, thanks for this thread...good stuff!

I thought that the main timeout problem was not the computation, but reading the data into their data-cache.
Two days worth the first call(before_trading_starts), then a number of months worth of data is read after that.
Each day, if the time-to-fill-the-data-cache + computation-time is > ~300 sec, you are at risk of TimeOutException.

I've had contest algos chug along for a long time(even over a year), near the 5 min limit for data+computation, then every so often(a few months), the data cache needs to be renewed,and if you get unlucky, that takes too much time on that particular day, with a particular load and BOOM, you timeout!

I'd be happy if they gave us the data-loading for free, and just counted computation time. I can manage that pretty well with timers.
The data-cache refreshing seems more elusive to manage the cache loading time, and it is frustrating to have an algo go along for a long time...then TimeoutException.

The other solution is to do a lot of computation off-platform, use the new MyData feature to upload answers from the computations, and use your 300 secs daily allocation more for data-loading/switching logic/ordering...of course this won't work too well if your major computations involve exotic free data you're getting from Q.
alan

Guy Fleury

@Alan and all, the solution should be simple.

The more we make backtests, the more sophisticated our programs become, requiring more and more processing time. That's OK, and should be expected. What Q should do is simply provide more time to those needing it.

Say after 1,000 backtests, they double the time, double the storage if a program every touch their current limits. After 2,000 backtests, they could double again, or on request, for those reaching beyond those new limits.

Novices do not need to have extended memory usage, they have a learning curve, but some of you guys surely need the extra time, storage, and would make good use of it.

Anyone that has done over a thousand backtest is a valuable resource to have around. There is a lot of experience in all of you. Q should treat those people much better, otherwise they might just go away for greener pastures. They sure have the skills for it. We already have lost quite a few.

IMHO.

Joakim Arvidsson (Cream Mongoose)

Thank you for this thread! I'm glad I'm not the only one with the TimeOutException error in before_trading_start. For me it's been seemingly random, and mostly for algos running in the contest, so quite frustrating.

What's the best/most efficient way to time my Pipeline computation (so I don't use up valuable time in before_trading_start)? Something like this?

import time

def before_trading_start(context, data):

    start = time.time()  
    context.output = pipeline_output('data_pipe')  
    end = time.time()  
    print(end - start)

Grant Kiehne

In my mind, perhaps the best way to move things forward would be to engage one or more of the prospective Quantopian Enterprise users on topics like this (ideally, summarizing their feedback here on the forum). First off, presumably at least some of them would have experience with other backtesting/trading platforms and could bring industry best-practices to the table (not that the Quantopian API is a turd, but there's always room for improvement). Also, for them, time is money, whereas for Quantopian Community, it is an avocation. For the Quantopian Enterprise professionals, I'd also expect that commodity compute power is the least of their concerns, and so I'd expect some constructive feedback to the Quantopian team, as they start using the platform. Any resulting changes driven by paying customers will be beneficial to Quantopian Community, since Fawce assures us on Important News for Our Community:

We are committed to maintaining parity between the Quantopian Community and Quantopian Enterprise platform capabilities.

So, if things play out as I'm thinking they might, we'll see some changes once Quantopian Enterprise is being used by professionals.

Ivory Ant

Thanks for looking into it, Q guys.

That said: don't focus too much on RollingLinearRegression in the specific example Luca gave. The problem occurs for any CustomFactor that takes about 4-5 seconds to calculate per day. It boils down to the timeout heuristic being applied to the entire chunk instead of each day in the chunk (see the calculation Grant provided). I think anyone would agree 4-5 seconds CPU time per-day isn't very extraordinary (to put it into perspective: I've used simple ML models written in C that take e.g. an hour per day, it's mostly caused by having a big universe)

I was just doing LinearRegression per-symbol for one variable to begin with, to get the basic 'framework' up and running. Eventually I'd move onto multiple independent variables and maybe throw in some regularization or non-linear models.

Personally I'm in the process of moving all calculations outside the pipeline so that I can take up to 5 minutes per day. For me it was necessary anyway to get around the window_safe problem, it's still a big hassle though, and off-by-one errors lurk everywhere.

Leo M

FWIW my pipeline+before_trading_start takes about 30s per day and I haven't faced any pipeline timeout issue in 2018. Might not be a systemwide issue. There were a few instances when some of the 15 year longest running ones would just hang after 12 plus hour, but still no timeout error.

Interpreted languages like Python generally make run time optimizations of huge mathematical calculations harder to debug in my opinion than say compiled code of C/C++. Python/Ruby/Perl etc. are easier to get up and running quickly but need some expertise/painful debugging to speed things up just because there are so many different ways of achieving the same result (and like Scott illustrated above the speed ups can be like 1000x if you do it the right way).

Alan Coppola

@Joakim,
Yes...your timer code is fine.
I embellish mine with recording the max time taken, and in logging the values to see the big jumps in resources used.
When the BTmaxelapsed gets close to 300, you are at risk of timing out the next time the data cache is refreshed.

def initialize(context):  
...
   context.BTmaxelapsed = 0.0 # Init BeforeTrading(BT) max timer  
...

def before_trading_start(context, data):  
    start = time.clock() # Start timer for computation of valid candidate stocks.  
    context.pipeline_data         = pipeline_output('factors').fillna(0)  
    context.risk_loading_pipeline = pipeline_output('risk_loading_pipeline').fillna(0)

    BTelapsed = time.clock() - start  
    if context.BTmaxelapsed < BTelapsed:  
        context.BTmaxelapsed = BTelapsed  
    log.info("BTmaxElapsed= {} BTcurElapsed= {}".format(context.BTmaxelapsed, BTelapsed))

Scott Sanderson

Hi All,

Thanks for all the feedback in this thread. I spent some time today putting together my thoughts about how we think about timeouts on Quantopian, and about how we could reduce the negative impact of timeouts. I can't promise any concrete changes in the near term, but hopefully this helps clarify our current thinking about the issue.

Why Do We Enforce Timeouts on Quantopian?

We enforce timeouts on Quantopian algorithms for several reasons:

Stopping Runaway Code Execution

The simplest reason we enforce timeouts on Quantopian is that we need to handle the possibility that a user accidentally writes an infinite loop in their algorithm. It's provably impossible to determine statically whether a long-running algorithm will terminate, so the simplest and most reliable option is to stop the algorithm with an error if it doesn't appear to make progress.

A similar failure mode happens when an algorithm attempts to solve a computationally intractable problem. CVXPY, for example, provides an interface for solving mixed-integer programming problems, which require an impractically long time to solve for large inputs. An algorithm that's trying to run an exponential-time algorithm on a large input is, for all practical purposes, the same as an algorithm that's stuck in an infinite loop.

Finite Computing Resources

Running financial simulations requires a lot of computing power. To ensure that Quantopian is available for everyone to use, we limit the amount of resources that we allow any single user to consume. One of our most limited resources is "number of backtests concurrently executing at any given time". If all of our users ran very long backtests, we'd end up running out of memory and CPU cores to start new backtests, so we limit the amount of time that any backtest is allowed to take. Very long backtests also complicate our ability to perform upgrades reliably.

Operating and Evaluating Funded Algorithms

A significant portion of Quantopian's business is operating algorithms to which we've provided allocations.

For us to be able to operate an algorithm in live trading, it needs to complete regular tasks in a reasonable amount of time. For our purposes here, "reasonable" means something like:

handle_data and scheduled functions need to complete reliably in under a minute.
before_trading_start needs to complete reliably in 10-15 minutes (though, ideally it completes much faster than that).

In addition to needing to be able to operate an algorithm in live trading, we also need to be able to simulate funded algorithms under various conditions (e.g., simulating different potential allocation sizes or different slippage models). This process doesn't impose quite the same hard limits that live trading does, but we still need algorithms to complete in a reasonable amount of time for our fund research team to be able to operate effectively.

How Do Pipelines Get Executed in Algorithms?

The Pipeline API is a batch execution system. It allows users to specify a set of computations that they'd like to run every day, and it efficiently executes those computations over a range of multiple days. The batch execution model is important because it allows us to re-use input data for rolling window calculations. For example, if we write a Factor that processes 200 days of trailing pricing data every day, it's wasteful to load 200 days' worth of data every day, because 199 of today's data points were also in yesterday's data. If we know that we're going to perform the same rolling window computation every day, it's much more efficient to pre-fetch all the data we'll need to compute our factor for the next N days and then slide a rolling window over that data to compute our factor.

To take advantage of Pipeline's batch processing model in algorithms, we pre-compute user-provided pipelines in large "chunks". The algorithm for this is implemented here in Zipline. It works as follows:

The first time pipeline_output is called in an algorithm, we pre-compute 5 days' of the requested pipeline's results and return the result for the current day.
On subsequent calls to pipeline_output, we check to see if we've already pre-computed the current day's pipeline result. If we have, we simply return it. If we haven't, we pre-compute the next block of pipeline output, with a chunksize of 126 trading days (approximately half a year).

Some immediate consequences of this algorithm:

Most calls to pipeline_output return almost immediately, because they simply return already-computed data.
The first time you call pipeline_output, it completes relatively quickly, because we only run a 5-day chunk of your pipeline. This is by design: the purpose of the short first chunk is to reduce the delay between starting an algorithm and getting feedback (either seeing a crash, or seeing performance results). The purpose of the short first chunk is not to try to prevent timeouts.
Every half year, pipeline_output takes much longer than usual to complete, because we pre-compute the next half-year's worth of results. Almost all before_trading_start timeouts trigger during one of these full-size pipeline chunk executions, and before_trading_start has a longer timeout than other functions on Quantopian primarily because we want to allow longer pipeline executions (though there are legitimate non-pipeline uses for before_trading_start as a general-purpose "do expensive daily computations" function).

What's the Problem?

I think I'm hearing a few distinct concerns in this thread:

It's frustrating to have an algorithm time out during before_trading_start because a pipeline_output call that's doing half a year's work took too long.
It's especially frustrating to have an algorithm successfully complete its first (short) pipeline_output call, only to time out on a later call after several minutes of simulation.
Five minutes isn't enough time to run a half-year chunk for many reasonable pipelines, especially if you do other non-trivial work in your before_trading_start.

To these concerns, I'll add a fourth issue that I consider a wart on the API: to ensure that pipeline executions fall under the 5-minute before_trading_start timeout, pipeline_output usually needs to be called in before_trading_start, but often you only need your pipeline output in a later scheduled function. This results in many before_trading_start implementations that look like:

def before_trading_start(context, data):  
    # Stash today's pipeline results for use later.  
    context.pipeline_data = pipeline_output('my_pipeline')

def my_scheduled_function(context, data):  
    # Grab the previously-stored pipeline result.  
    pipeline_data = context.pipeline_data

One redeeming feature of this idiom is that it makes it clear that pipeline results only update once per day. For the most part, however, it just feels like boilerplate to me. I think it would be nicer if we could just write:

def my_scheduled_function(context, data):  
    pipeline_data = pipeline_output('my_pipeline')

and not have to worry about which timeout applies when pipeline_output has to do actual work.

How Can We Make Things Better?

The easiest way to reduce the impact of timeouts on Quantopian would be to simply increase our time limits. If we bump the before_trading_start timeout from, say, 5 minutes to 10 minutes, many pipelines that currenty time out would run successfully. The problem with simply increasing the BTS timeout, however, is that an algorithm might actually spend its full 10 minute budget on every call to before_trading_start which would cause the algorithm to take much more time than we could realistically support (even the current 5-minute BTS timeout theoretically allows an algorithm to take over 20 hours to complete a year of simulation, which isn't really feasible for us to suppport).

In my opinion, a better improvement would be to change how pipeline_output works so that we always run pipeline chunks immediately before the start of before_trading_start. This would allow us to impose a separate, longer, (say, 10 minute), timeout on batched pipeline executions while still preventing algorithms from spending an excessive amount of time every day in before_trading_start. This change would also remove the need for invoking pipeline_output in before_trading_start, because we'd be able to guarantee that pipeline results are ready by the time you request them.

A more radical long-term idea might be to add something like a "progress budget" to the backtester. For example, we might start an algorithm with a 15-minute budget that starts ticking down when the algorithm starts. If the budget hit 0, we would trigger a timeout, but we'd add 5 minutes back to your budget for every month of completed progress. A system like this would enable algorithms that occasionally run a very expensive function, so long as that function wasn't run too often. Under this system, we'd probably still have to maintain time limits on individual API functions to ensure that we could support live operation, but we might be able to relax them a bit more. Implementing and properly tuning a system like this would require a lot of work, so I think it's unlikely that we'll build something like this in the immediate future, but it might be something for us to think about moving forward.

Disclaimer

Leo M

Scott, with regard to your statement "If all of our users ran very long backtests, we'd end up running out of memory and CPU cores to start new backtests, so we limit the amount of time that any backtest is allowed to take."

It appears to me that Q is trying to make the best of a free solution with timing compromises that works well at an aggregate level for the masses as you take in tens of thousands of new members every month. But recent changes will be at the detriment of active users who are the most likely to contribute to writing new contest style algorithms.

My recommendation would be to identify your target audience, users who have been able to submit an algorithm to the new contest and give them special benefits. Longer backtest completions, wider timeouts etc. Or an option to pay for servertime. I would be more than willing to pay for servertime to have long running backtests finish to completion.

Joakim Arvidsson (Cream Mongoose)

Hi @Scott,

Thank you for your attention to this, and for providing such a detailed and thorough reply. Really appreciated!

How about a layered control approach? Would it make sense to widen the BTS time out (to say 10 min), and at the same time also impose a wider global time out limit (say 1h / year of backtest) to prevent algos abusing the wider BTS limit?

Also, unfortunately I don't see any explanation as to why some of us have no time out issues when running backtests, but once submitted to the contest the algo starts to time out in BTS after running fine for a few days. For me this has been the main issue, and I don't appear to be alone with this problem. Very frustrating.

Hi @Alan,

Thank you for providing your max time recording code.

Grant Kiehne

Thanks Scott -

I think we all appreciate that this is a convoluted, gnarly problem for you guys--a classic system engineering problem.

Before getting into the nitty-gritty, let me just lay something out for consideration:

I'm getting the distinct impression that you simply don't see compelling use cases for expensive computing on your platform.

If this is a correct assumption, then we are just piddling around. I don't mean this to be sarcastic; my assumption is that you've thought things through, and the numbers just don't work out.

Some feedback:

"even the current 5-minute BTS timeout theoretically allows an algorithm to take over 20 hours to complete a year of simulation, which isn't really feasible for us to suppport"--If you cannot support 20-hour simulations, then what can you support? What is the rationale for this requirement? For example, at 1 minute per day, 252 trading days per year, if you did support 20-hour simulations (leaving 4 hours every day for upgrades), you could support backtests of 5 year duration, which is decent, but not really gonna cut it for serious R&D. One approach would be, rather than reserving N hours every day for routine upgrades on-the-fly, to batch the upgrades, and do them on a weekly/monthly/quarterly schedule.
At 1 minute per day, going to 10 minutes per 6-month chunk won't even come close. You'd need to support 126 minutes for every 6-month chunk, to allow 20-hour backtests covering 5 years of simulated trading. Why not let users make the trade-off, with an N-hour constraint on backtest time (and perhaps number of concurrent backtests), with 10 minutes of pre-open compute time?
"If all of our users ran very long backtests, we'd end up running out of memory and CPU cores to start new backtests, so we limit the amount of time that any backtest is allowed to take." This is odd. I had the impression that you run on an infinitely scalable cloud platform. So, the idea of running out of anything is lost on me. Your statement gives the impression that if all 195,000 of your users suddenly got inspired to run backtests, you'd be sunk. Really? Wouldn't you just expand your usage of the cloud platform? Or are you saying that there is a cost constraint?
My understanding is that you also need to run backtests to evaluate algos that users have written, but have not yet received an allocation. I am under the impression that perhaps all full backtests get a look-see after 6-months, to see if they are worthwhile. You did not mention this explicitly, but if it is the case, it would be another consideration.
My recommendation would be to make a more fundamental break between R&D and production flow. Anyone doing serious R&D is eventually going to pull their hair out on your platform, frankly. Your paradigm is that all of the R&D efforts need to smoothly and seamlessly flow into production. One approach would be to just leave the backtesting/live trading API as-is, but augment the research platform to be able to run backtests with fewer constraints. Let the R&D drive the production capabilities; this is how technological innovation happens.
Above, you mention being able to run "on my development machine" (which presumably is a box sitting on your desk). If you want to cut through all of this over-constrained nonsense, one idea would be to set up a development machine for users. In fact, I have won ~$1000 in the current contest, and if you are game, I'd be willing to put it toward a machine, in exchange for a small percentage of the profits on any algo developed on it (I realize traceability is impossible, but so long as I get my $1000 back plus a reasonable return, I'm good). Perhaps others would be willing to chip in, as well. I'd even procure the machine, build it, drive it to Boston, and help you set it up.

Guy Fleury

The Q-community is Q's talent pool. They want from this pool (us) to extract the best talent they can find. Their mission is easy: pick the contest winners. Solving an HR problem at the same time.

The Q-community is composed of really bright people. I see it every day in all those posts, notebooks, algos and insights. I like, read, and appreciate all of it.

The problem is that at times it gets frustrating to have limited access, limited resources.
Timeouts are particularly a problem. In reality, presently, I do not even mind having timeouts. But it does end with: not again!

It is just that they happen AFTER waiting for so long for the program to terminate. Oftentimes with no indication that the program stopped running. And this translates into “wasting MY time”. Now, that is a big distinction. Such frustration can lead to say: enough of this, move on, find a better solution, and probably you will stop saying: not again *&$/(?/#&@”.

Q should consider that their talent pool is composed of people, in a way their hidden diamond in the rough. We are valuable. We are a group of people with the highest standards ready to develop programs not only for ourselves but which could also highly benefit Q.

My word to Q is: take care of “your” talent pool. Give them the resources they need since that is how you are going to win. It is to your benefit to stop those timeouts. Any day, I would let anyone contributing to the pool of knowledge, or designing sophisticated strategies requiring more resources than others have their way. Any day, I would try to provide whatever they need, even if I had to make exceptions to the rules. Have the most talented use more of the resources if need be.

I would let people like Dan Whitnable with his great contributions, not to mention all the others (it would be too long to list them all) have the resources they need when they need it. It is in your best interest. You should see my point.

Grant Kiehne

Here's another thought:

The new self-serve data API could be enabled on the research platform (along with my backtesting suggestion above). Additionally, there would be some way to run a daily script on the research platform, to generate a daily feed, for within-platform upload. Then, one or more algos could grab the data, and with minimal processing, generate orders. This has the potential of keeping the backtester/live trading API lightweight, while building up compute resources in the more flexible research platform.

The idea is just a twist on the new self-serve data API, except that everything is done on-platform, and thus all data sets are available, and the user doesn't need to worry about setting up any infrastructure; everything is integrated in the research platform.

It is reasonable that Quantopian wouldn't be able to offer unlimited compute resources to the planet. For those users who are not yet able to offer their services in exchange for resources (e.g. either by merit, contest ranking, or an allocation), it would be reasonable to apply a SaaS model (as we now have for premium data sets, and as will be applied to Quantopian Enterprise institutional users).

Luca

@Scott, thank you for your very clear reply and for taking the time of reading and carefully consider our points. Unfortunately I wasn't able to properly communicate my concern, as I can see it is not mentioned in your reply. Please let me try to express it again.

I started this thread for a pure technical reason, it was a simple request with a limited scope. Then the topic evolved into something else (the limited resources), which is worth discussing anyway, but that wasn't my point.

Here is my point: I believe the current pipeline timeout is flawed and inconsistent and It'd be nice to fix it. I generally find the choices made by the Q team very elegant and clever but I find really hard to accept the design of this pipeline timeout.

Why is the current behavior flawed?

If I used the Pipeline to simply fetch the data I need (that doesn't require much time so it can scale to 6 months chunk easily) and I moved all the computation that I otherwise would have done in Pipeline inside before_trading_start (or even in handle_data) that would allow at least 5 minutes computation per day, because I would use the time available in before_trading_start for the computation of that single day, not the whole chunk. This means a user has actually access to more daily computation time than what pipeline allows, so why creating this timeout in the first place? My answer is that it is just a poor workaround. What Quantopian would actually like is a maximum daily computation time, but that is not properly expressed in the algorithm requirements (the timeout is in a specific and optional API) and neither enforced (it can be avoided not using Pipeline for computation).

There is also a subtle dishonesty here. The more Pipeline have become nicer and powerful the more the users have changed their code to make use of Pipeline. Nobody ever told us that we were giving away computation resources. Sure we could go back and rewrite our computation without Pipeline, like we used to do, but it doesn't really make sense.

Why is the current behavior inconsistent?

It is inconsistent because it is not deterministic from user perspective. The chunk size is defined by Quantopian, so we are given a fixed Pipeline timeout that depends on a variable size that is defined by someone else. Whaaaaat?
To be consistent Q should tell us what is the daily timeout and scale that to the chunk size. Even if it takes forever to compute the Pipeline output there would be a global algorithm timeout that would stop the algorithm anyway.

Leo M

"Also, unfortunately I don't see any explanation as to why some of us have no time out issues when running backtests, but once submitted to the contest the algo starts to time out in BTS after running fine for a few days. For me this has been the main issue, and I don't appear to be alone with this problem. Very frustrating."

@Joakim, I can give you a plausible answer because I think I know why. The unpredictability of when you hit the BTS timeouts in a contest vs backtest indicates that there is a central resource in contention that has varying levels of response times based on current load. Loads and responsiveness of database systems for instance are a function of the worst query running at that moment. One bad query can slow down the database and a lot of other queries running at that time. Applying a global timeout of 5 mins in BTS leaves you to the vagaries of external events outside your algorithm like the ones I mentioned above. Quantopian needs to fix that pipeline data refreshes should not be considered as part of user's BTS timeout. That just introduces a lot of uncertainties in terms of what an user can do in a deterministic way. The timeout should perhaps be separated into a wider pipeline refresh timeout to account for external Quantopian dependancies and a separate BTS timeout that starts once that call returns and actual user processing starts.

Leo M

Luca, based on my understanding daily BTS timeout is the same 5minute everyday. The only variable is how much time is spent in fetching pipeline data which according to Scott’s explanation can be one of three values
Day 1 of simulation: fetch 5 day worth of data
Day 2,3,4,5 use cached pipeline data
Day 6: fetch 126 day worth of pipeline data
Day 7, 8 etc you use cached data

The days you are most likely to hit time-outs are the days 126 days worth of data is fetched and cached in pipeline (what Scott refers to as "when pipeline_output has to do actual work"). That day your time will be 5min - data fetch time for 126 days.

Joakim Arvidsson (Cream Mongoose)

Thanks @Leo M!

Your explanation makes perfect sense to me, but then again I'm no expert on Q's backend infrastructure and application architecture. @Scott is though so hopefully he will be able to shed some light on this and either confirm of deny your explanation.

Leo M

My concern is not the timeouts because I don’t do minutes worth of computation in BTS, I am getting my backtests hang when they go beyond (10-12 hours), I view them similarly and I did not get any responses to my posts in that regard. Ultimately it boils down to cost and resource allocation.
One path is to reduce the possibilities of what can be done in the platform citing some limitations while allowing the same level of service to current and the 100k+ members joining every year and thus keep going lower and lower on total time a backrest is allowed to run, and the other is to figure out what active members who are trying to write allocation worthy algorithms need and cater to their demands.

Some ideas that come to my mind is charge a yearly membership fee to cover the cost of the service.
Implement priority queues and move jobs between low and medium priority queues based on current resource utilization.

Recent policies like hanging backtests that go beyond 10-12 hours (when it has done 80-90% of the processing) hurts their own purpose by limiting the possibilities on the platform.

Allow 2 hours per year of backtest and don't hang the process in the middle or towards the tail end, instead move it to a lower priority queue where it does the remainder at a slower rate. That's my request, similar to some unlimited data plans where you get the best speeds upto a certain usage and then the speed lowers but still progresses to completion.

For the timeouts, allow a similar scaling up. If the first timeout with 5 min doesn't get hit then scale it up to like 10min for future days. That will serve both the immeadiate feedback when something is wrong with the algorithm as well as meet the requirements to be able to do long running calculations in BTS without the worry of hitting the 5min timeout.

For the contests, I'd suggest change the BTS timeouts to 10min. You have already screened using contest criteria that the algorithm doesn't go in an infinite loop. To disqualify a contest entry with rigorous 5min timeout after a few days running in the contest serves no purpose since the original possibility of infinite loop doesn't exist at that point.

I'd also suggest Quantopian take a longer term view in decision making. If we limit today what strategies users are allowed to develop in the platform using a rationale that Investment team needs quicker turnaround of backtests to operate effectively, what happens if say in 2 years hardware improves to such an extent that you are able to do computations 2x faster, then you suddenly end up in a situation where investment team can evaluate strategies that are lot more computationally intensive but those type of strategies were never allowed in the platform due to past policy restrictions.

Grant Kiehne

Another random idea is that compute resource allocation could be an interesting ML problem, in the context of Quantopian. Q undoubtedly has lots of historical meta-data on users (doesn't every company operating on the web?), and now a cadre of users who have received allocations, and the contest data, etc. It should be feasible to caste this into a ML problem, to sort out who gets what resources and when. Users could also put in requests, by filling out surveys, etc. Users could also flag their status. Personally, I've been taking a break, since submitting a set of algos to the contest. One could imagine a scale of 1 to 5, with 1 being hibernation, and 5 being highly active. In hibernation, for example, one would earn credit toward a burst of high activity (presently, I'd rate my level of effort as a 2). The ML algo could take this as input, along with other factors, to determine the overall resource allocation for a given individual.

James Villa

Reading from @Scott's explanation and issues raised regarding compute timeouts, it is clear that the problem is one of resource budgeting and management. As an immediate solution, a power user should be able to file a request to support for additional compute time after presenting his use case. Support could then issue a time-bound increase in compute time for the user to test. User reports back to support success / failure of algo. Support decides whether to extend time bound resource allocaton . Something along these lines. This would at least give some sophisticated, compute intensive algos a chance to see its end.

I have a current immediate use case running an ML algo that meets all contest criteria with decent returns and stats but is limited as to the number of stocks (100-120) to evaluate. Over that, I get compute timeout. I would like to be able to increase the number of stocks to select from to further diversify and mitigate risks to bring down volatility and drawdown without getting timed out. Can I file a use case request for increase compute resource allocation to further test my ML algo?

Grant Kiehne

Several important considerations:

Scalability to a large number of users.
Supporting, at any given point in time, users with various levels of compute resources.
Need for at least 6 months of out-of-sample time, over which resources would need to be allocated.

From a user standpoint, the over-riding issue is that for ~$1000 or so, one could build a desktop workstation that would cream the Q platform. So, it is kinda silly to be working on an algo that trades $10M with compute resource limitations.

Scott Sanderson

@Luca:

I think my reply was attempting address the concerns you raised in the original post, though I could have been clearer in explaining exactly why I think the solutions outlined at the end of my post would help. Let me try to clarify:

I believe your concern with current behavior is that we enforce the same timeout on a half-year pipeline execution that we enforce on a single-day's before_trading_start call, which means, effectively, that we give you 126 times less "compute time per day" for pipeline calculations than we give you for work that happens in BTS proper. That "compute time per day" imbalance means that if you're trying to optimally take advantage of our current timeout logic, you're incentivized to move your nontrivial computation logic out of the Pipeline API and into BTS.

I agree with most of that criticism, but I think I disagree with the conclusion that we should create a separate timeout on pipeline_output and scale it to 5 * chunksize minutes.

Here's why I disagree:

We don't really expect (or want) algorithms to use the full five minutes of before_trading_start time every day. If an algorithm spent that five minutes each day, it would take 21 hours for the algorithm complete a year of backtesting, which isn't feasible for us to support.

The primary reason that the timeout in before_trading_start is 5 minutes is to support the execution of 126-day pipeline chunks (which we're okay with because they happen relatively infrequently). The BTS timeout used to be 1 minute. We increased it to 5 minutes a few years ago because many people were hitting pipeline timeouts when using the old morningstar fundamentals API, and the easiest way to allow longer pipeline executions was to increase the timeout of BTS. As I mentioned in my original reply, however, bumping the timeout had the unintended consequence of allowing algorithms to spend much more time doing non-pipeline work in BTS.

So, I think we agree that there's currently an imbalance between the amount of compute time per day allowed for pipeline vs. non-pipeline calculations, and I think we agree that there are legitimate use-cases for which users might need more than (5 / 126) minutes per day to perform pipeline calculations, but I disagree that the right solution is to simply equalize the compute-time-per-day for pipeline vs. non-pipeline calculations by increasing the pipeline timeout, because:

There isn't actually a distinct "pipeline timeout" right now at all. Our current mechanism for allowing longer pipeline executions is simply to increase the timeout of before_trading_start, which doesn't actually fix the imbalance, because it increases the time allowed for both pipeline and and non-pipeline executions.
If we had a separate pipeline timeout (which I think is a good idea) I don't think we should make it equal to (5 * chunksize minutes), because we don't actually want to support algorithms that take that much time every day. The fact that we currently do so is an accidental side-effect of trying to support (occasional) longer pipeline executions.

As I mentioned in my original reply, I think the right long-term solution is probably to have a uniform budget of time per backtest-day, which would allow users to decide for themselves how best to allocate their processing time. Implementing and tuning that system would be a fairly involved change though, which is why I think a good incremental improvement would be to move pipeline execution out of before_trading_start, which would allow us to modestly increase the pipeline execution timeout without also increasing the BTS timeout further.

Hope that all makes sense. For the other folks in this thread, I wanted to mention that I've read all your comments, and if I don't reply to you directly it's only because composing useful replies about this sort of thing takes a long time. Thanks for all the feedback!

Disclaimer

Luca

Thank you @Scott for taking the time of properly answer my concerns. As usual your replies are very clear and I am happy that my point has been correctly understood. I now better understand yours. On my side I am happy now. I still won't be able to run my algorithms for some time, until Q is able to increase the user resources, but that's life.

As a side note, I started this thread because I genuinely thought that the Pipeline timeout was a bug. My reasoning was "since we already have a daily computation time of 5 minutes in BTS + 1 minutes for each handle_data then what difference does it make a little more computation time in Pipiline?". Now the full picture is clear anyway and If I had known that since the beginning I wouldn't have started the thread. On the other hand there have been other interesting points raised by the other people, so maybe it was worth it.

Joakim Arvidsson (Cream Mongoose)

Thanks @Scott, much appreciated!

Grant Kiehne

Hi Scott -

Your proposal is summarized in:

a good incremental improvement would be to move pipeline execution out of before_trading_start, which would allow us to modestly increase the pipeline execution timeout without also increasing the BTS timeout further.

From an architectural standpoint, this makes sense. It frees up before_trading_start to be used for up to 5 minutes per day, independent of the periodic Pipeline chunking, which would be increased to 10 minutes. Additionally, users would have ~50 seconds per trading minute available.

I'm confused by a couple statements:

If all of our users ran very long backtests, we'd end up running out of memory and CPU cores to start new backtests, so we limit the amount of time that any backtest is allowed to take. Very long backtests also complicate our ability to perform upgrades reliably.

As I mentioned in my original reply, I think the right long-term solution is probably to have a uniform budget of time per backtest-day, which would allow users to decide for themselves how best to allocate their processing time. Implementing and tuning that system would be a fairly involved change though...

So is there actually an automatic limit on how long a backtest can take? Or are you saying that you manually cut off long-running backtests?

Overall, I gather that you are playing an actuarial game with respect to free compute resources, and the return on this investment. You should have some decent statistics on usage of your system, and should even be able to do longitudinal studies to understand outcomes on an individual user basis. Presumably, by design, you allow all users the opportunity to chose to run very long backtests, but have data to support that only a limited number will do so at any given time. The problem would seem to be self-limiting, as well--most users are not going to want to wait a long time for results, and so one would not expect the vast majority of users to be running long backtests. So, I don't understand your concern that lots of users could suddenly start running long backtests. I guess the idea is that you can control the tail of the distribution with the time-outs under discussion.

I'm curious how this problem is solved by professional quants and their employers. How do other backtesting engines compare? Your proposal is to go to 10 minutes per 126 days of Pipeline computations, which works out to (10 minutes)*(60 seconds/minute)/(126 days) = ~ 5 seconds/day. So, you are advertising a backtesting system that limits computations to 5 seconds per day (assuming that the bulk of the computations are done in Pipeline, which is preferred). This seems kinda slim.

It seems that any discussion of high-performance computing on Quantopian is on the back burner. This is perplexing, since I would think this option would be attractive to your Quantopian Enterprise users (and their use case would be different from the Community, so the constraints you mention above would not necessarily apply). What is the story here?

Grant Kiehne

Yesterday, I got the message below. No big deal, but the algo was submitted to the new contest when it started (many months ago), and suddenly, out-of-the-blue, it crashed. Maybe it was on the edge of disaster from the get-go; I did not do a timing analysis when I launched it. To my knowledge, there is nothing in the algo that would make the run time increase with time (e.g. storing an expanding window of data in context).

If Q support wants to have a look, feel free. And if you want to review the code, I'd be glad to give you permission, if a deep-dive postmortem would be helpful.

Hello Grant,

Your contest entry, multi-factor higher turnover, has been disqualified from the competition for the following reason(s):

Algorithm raised an exception

"TimeoutException: Call to before_trading_start timed out"

Navigate to your dashboard or click here to see the full backtest that generated this result.

After you adjust your algorithm, resubmit it to the contest. Before submitting an entry, you can test whether or not it satisfies the constraints by running it through the notebook in Lesson 11 of the Contest Tutorial.

This is an automatically generated email. Please do not reply to this message. If you need assistance, please email [email protected].

Blue Seahawk

Also here yesterday.

Alisa Deychman

Hi all, we have also noticed the spike in contest entries timing out and are looking into it. We don't have any news to share yet, but once we have an update, we will let you know.

We're reaching out to those who had entries timeout and are offering to continue them. If we missed you, let us know! Send us a request to [email protected].

Disclaimer

Grant Kiehne

Hi Alisa -

Any insight into what happened? What was the root cause of the spike in contest entries timing out? And is there anything users can do to guard against the failure mode in the future?

Jacob Nazarenko

Hi All,

We've recently implemented a fix according to Scott's most recent post above -- please see the original post here for more information.

You've successfully submitted a support ticket.

Our support team will be in touch soon.