Hi Grant,
Is it possible to simulate inputs to pipeline in the research platform? For example, could one replace:
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
with data simulated in the workbook?
Short Answer:
What you're describing is probably possible, and the system architecture is explicitly designed to support this sort of extension, but there isn't a clean way to do it in research right now. The end of the long answer has some thoughts for what an extension API might look like.
Long Answer:
One of the core ideas in the Pipeline architecture is that it separates the abstract interface to a dataset from the concrete implementation of loading that dataset. In simple terms, this means that we define how a dataset will be used in one place, and we define how a dataset should be loaded in another place. The main benefit of this is that we can have multiple loader implementations for any given dataset, which is a nice property for an open-source library like Zipline, particularly in an industry where most of the participants are working with closed/proprietary data. Another context in which its useful to be able to swap out loader implementations is in testing, as you allude to in the second half of your post.
The main downside of decoupling dataset definition from dataset implementation is that we have to do the extra work of matching up datasets and loaders whenever we want to actually use a dataset. The main way this is accomplished for Pipeline loaders in Zipline is that SimplePipelineEngine
(the class that's responsible for loading pipeline data and passing it to user-defined compute functions) accepts a dispatching function named get_loader
whose signature is get_loader(column) -> loader
.
So, if you were working with Zipline outside the Q research platform, the way you'd do what you're describing would be something like:
from zipline.pipeline.data import Column, Dataset
class MyDataset(DataSet):
column_A = Column(dtype=float)
column_B = Column(dtype=bool)
You'd then have to build a function that dispatches from columns of your datasets to loaders for those columns. For large datasets, this is a nontrivial engineering challenge with many interesting tradeoffs, but if you just want to test against known data, you'd probably want to describe your data as a simple in-memory structure, and in factzipline.pipeline.loaders.DataFrameLoader
provides a loader that uses an in-memory frame as its internal storage. The code for setting up a dispatching function to load columns from MyDataSet
would be something like:
from zipline.pipeline.engine import SimplePipelineEngine
from zipline.pipeline.loaders.frame import DataFrameLoader
# Provide data for 2014.
dates = pd.date_range('2014-01-01', '2015-01-01')
# Provide data for just AAPL and MSFT.
assets = symbols(['AAPL', 'MSFT'])
# The values for Column A will just be a 2D array of numbers ranging from 1 -> N.
column_A_frame = pd.DataFrame(
data=arange(len(dates) * len(assets), dtype=float).reshape(len(dates), len(assets)),
index=dates,
columns=assets,
)
# Column B will always provide True for AAPL and False for MSFT.
column_B_frame = pd.DataFrame(data={assets[0]: True, assets[1]: False}, index=dates)
loaders = {
MyDataSet.column_A: DataFrameLoader(MyDataSet.column_A, column_A_frame),
MyDataSet.column_B: DataFrameLoader(MyDataSet.column_B, column_B_frame),
}
# If you're feeling fancy, you could also write this as:
# my_dispatcher = loaders.__getitem__
def my_dispatcher(column):
return loaders[column]
Unfortunately, there are still some obstacles in the way of using this on the research platform:
- The
SimplePipelineEngine
instance used by quantopian.research.run_pipeline
is set up behind the scenes, so there's no supported way to pass in your custom dispatch function. The engine construction and storage is hidden by design, since the engine holds references to loaders which in turn hold references to database resources that we don't want to expose publicly for security and performance reasons.
- Even if you could construct your own engine on the research platform, the engine needs to additional resources beyond the loader dispatcher, which aren't as easily customized.
Given all of this, my first thought on making this more extensible on Q research would be to add a new optional argument to run_pipeline
(e.g. extra_loaders
), which would allow users to provide a dictionary mapping columns to custom loaders. Doing so would allow users to inject their own custom datasets into the loader machinery safely, and would even allow users to override the loaders for existing datasets if done correctly.