Hi All,
Thanks for all the feedback in this thread. I spent some time today putting together my thoughts about how we think about timeouts on Quantopian, and about how we could reduce the negative impact of timeouts. I can't promise any concrete changes in the near term, but hopefully this helps clarify our current thinking about the issue.
Why Do We Enforce Timeouts on Quantopian?
We enforce timeouts on Quantopian algorithms for several reasons:
Stopping Runaway Code Execution
The simplest reason we enforce timeouts on Quantopian is that we need to handle the possibility that a user accidentally writes an infinite loop in their algorithm. It's provably impossible to determine statically whether a long-running algorithm will terminate, so the simplest and most reliable option is to stop the algorithm with an error if it doesn't appear to make progress.
A similar failure mode happens when an algorithm attempts to solve a computationally intractable problem. CVXPY, for example, provides an interface for solving mixed-integer programming problems, which require an impractically long time to solve for large inputs. An algorithm that's trying to run an exponential-time algorithm on a large input is, for all practical purposes, the same as an algorithm that's stuck in an infinite loop.
Finite Computing Resources
Running financial simulations requires a lot of computing power. To ensure that Quantopian is available for everyone to use, we limit the amount of resources that we allow any single user to consume. One of our most limited resources is "number of backtests concurrently executing at any given time". If all of our users ran very long backtests, we'd end up running out of memory and CPU cores to start new backtests, so we limit the amount of time that any backtest is allowed to take. Very long backtests also complicate our ability to perform upgrades reliably.
Operating and Evaluating Funded Algorithms
A significant portion of Quantopian's business is operating algorithms to which we've provided allocations.
For us to be able to operate an algorithm in live trading, it needs to complete regular tasks in a reasonable amount of time. For our purposes here, "reasonable" means something like:
handle_data and scheduled functions need to complete reliably in under a minute.
before_trading_start needs to complete reliably in 10-15 minutes (though, ideally it completes much faster than that).
In addition to needing to be able to operate an algorithm in live trading, we also need to be able to simulate funded algorithms under various conditions (e.g., simulating different potential allocation sizes or different slippage models). This process doesn't impose quite the same hard limits that live trading does, but we still need algorithms to complete in a reasonable amount of time for our fund research team to be able to operate effectively.
How Do Pipelines Get Executed in Algorithms?
The Pipeline API is a batch execution system. It allows users to specify a set of computations that they'd like to run every day, and it efficiently executes those computations over a range of multiple days. The batch execution model is important because it allows us to re-use input data for rolling window calculations. For example, if we write a Factor that processes 200 days of trailing pricing data every day, it's wasteful to load 200 days' worth of data every day, because 199 of today's data points were also in yesterday's data. If we know that we're going to perform the same rolling window computation every day, it's much more efficient to pre-fetch all the data we'll need to compute our factor for the next N days and then slide a rolling window over that data to compute our factor.
To take advantage of Pipeline's batch processing model in algorithms, we pre-compute user-provided pipelines in large "chunks". The algorithm for this is implemented here in Zipline. It works as follows:
- The first time
pipeline_output is called in an algorithm, we pre-compute 5 days' of the requested pipeline's results and return the result for the current day.
- On subsequent calls to
pipeline_output, we check to see if we've already pre-computed the current day's pipeline result. If we have, we simply return it. If we haven't, we pre-compute the next block of pipeline output, with a chunksize of 126 trading days (approximately half a year).
Some immediate consequences of this algorithm:
- Most calls to
pipeline_output return almost immediately, because they simply return already-computed data.
- The first time you call
pipeline_output, it completes relatively quickly, because we only run a 5-day chunk of your pipeline. This is by design: the purpose of the short first chunk is to reduce the delay between starting an algorithm and getting feedback (either seeing a crash, or seeing performance results). The purpose of the short first chunk is not to try to prevent timeouts.
- Every half year,
pipeline_output takes much longer than usual to complete, because we pre-compute the next half-year's worth of results. Almost all before_trading_start timeouts trigger during one of these full-size pipeline chunk executions, and before_trading_start has a longer timeout than other functions on Quantopian primarily because we want to allow longer pipeline executions (though there are legitimate non-pipeline uses for before_trading_start as a general-purpose "do expensive daily computations" function).
What's the Problem?
I think I'm hearing a few distinct concerns in this thread:
- It's frustrating to have an algorithm time out during
before_trading_start because a pipeline_output call that's doing half a year's work took too long.
- It's especially frustrating to have an algorithm successfully complete its first (short)
pipeline_output call, only to time out on a later call after several minutes of simulation.
- Five minutes isn't enough time to run a half-year chunk for many reasonable pipelines, especially if you do other non-trivial work in your
before_trading_start.
To these concerns, I'll add a fourth issue that I consider a wart on the API: to ensure that pipeline executions fall under the 5-minute before_trading_start timeout, pipeline_output usually needs to be called in before_trading_start, but often you only need your pipeline output in a later scheduled function. This results in many before_trading_start implementations that look like:
def before_trading_start(context, data):
# Stash today's pipeline results for use later.
context.pipeline_data = pipeline_output('my_pipeline')
def my_scheduled_function(context, data):
# Grab the previously-stored pipeline result.
pipeline_data = context.pipeline_data
One redeeming feature of this idiom is that it makes it clear that pipeline results only update once per day. For the most part, however, it just feels like boilerplate to me. I think it would be nicer if we could just write:
def my_scheduled_function(context, data):
pipeline_data = pipeline_output('my_pipeline')
and not have to worry about which timeout applies when pipeline_output has to do actual work.
How Can We Make Things Better?
The easiest way to reduce the impact of timeouts on Quantopian would be to simply increase our time limits. If we bump the before_trading_start timeout from, say, 5 minutes to 10 minutes, many pipelines that currenty time out would run successfully. The problem with simply increasing the BTS timeout, however, is that an algorithm might actually spend its full 10 minute budget on every call to before_trading_start which would cause the algorithm to take much more time than we could realistically support (even the current 5-minute BTS timeout theoretically allows an algorithm to take over 20 hours to complete a year of simulation, which isn't really feasible for us to suppport).
In my opinion, a better improvement would be to change how pipeline_output works so that we always run pipeline chunks immediately before the start of before_trading_start. This would allow us to impose a separate, longer, (say, 10 minute), timeout on batched pipeline executions while still preventing algorithms from spending an excessive amount of time every day in before_trading_start. This change would also remove the need for invoking pipeline_output in before_trading_start, because we'd be able to guarantee that pipeline results are ready by the time you request them.
A more radical long-term idea might be to add something like a "progress budget" to the backtester. For example, we might start an algorithm with a 15-minute budget that starts ticking down when the algorithm starts. If the budget hit 0, we would trigger a timeout, but we'd add 5 minutes back to your budget for every month of completed progress. A system like this would enable algorithms that occasionally run a very expensive function, so long as that function wasn't run too often. Under this system, we'd probably still have to maintain time limits on individual API functions to ensure that we could support live operation, but we might be able to relax them a bit more. Implementing and properly tuning a system like this would require a lot of work, so I think it's unlikely that we'll build something like this in the immediate future, but it might be something for us to think about moving forward.