Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
possible to simulate inputs to pipeline in the research platform?

My interest is exploring the feasibility of simulating data sets as inputs to pipeline, versus using real-world data.

Is it possible to simulate inputs to pipeline in the research platform? For example, could one replace:

from quantopian.pipeline.data import morningstar  
from quantopian.pipeline.data.builtin import USEquityPricing  

with data simulated in the workbook?

Alternatively, is it possible to modify the values of the imported data sets, prior to processing by pipeline?

9 responses

Hi Grant,

Is it possible to simulate inputs to pipeline in the research platform? For example, could one replace:

from quantopian.pipeline.data import morningstar  
from quantopian.pipeline.data.builtin import USEquityPricing  

with data simulated in the workbook?

Short Answer:

What you're describing is probably possible, and the system architecture is explicitly designed to support this sort of extension, but there isn't a clean way to do it in research right now. The end of the long answer has some thoughts for what an extension API might look like.

Long Answer:

One of the core ideas in the Pipeline architecture is that it separates the abstract interface to a dataset from the concrete implementation of loading that dataset. In simple terms, this means that we define how a dataset will be used in one place, and we define how a dataset should be loaded in another place. The main benefit of this is that we can have multiple loader implementations for any given dataset, which is a nice property for an open-source library like Zipline, particularly in an industry where most of the participants are working with closed/proprietary data. Another context in which its useful to be able to swap out loader implementations is in testing, as you allude to in the second half of your post.

The main downside of decoupling dataset definition from dataset implementation is that we have to do the extra work of matching up datasets and loaders whenever we want to actually use a dataset. The main way this is accomplished for Pipeline loaders in Zipline is that SimplePipelineEngine (the class that's responsible for loading pipeline data and passing it to user-defined compute functions) accepts a dispatching function named get_loader whose signature is get_loader(column) -> loader.

So, if you were working with Zipline outside the Q research platform, the way you'd do what you're describing would be something like:

from zipline.pipeline.data import Column, Dataset

class MyDataset(DataSet):  
    column_A = Column(dtype=float)  
    column_B = Column(dtype=bool)  

You'd then have to build a function that dispatches from columns of your datasets to loaders for those columns. For large datasets, this is a nontrivial engineering challenge with many interesting tradeoffs, but if you just want to test against known data, you'd probably want to describe your data as a simple in-memory structure, and in factzipline.pipeline.loaders.DataFrameLoader provides a loader that uses an in-memory frame as its internal storage. The code for setting up a dispatching function to load columns from MyDataSet would be something like:

from zipline.pipeline.engine import SimplePipelineEngine  
from zipline.pipeline.loaders.frame import DataFrameLoader

# Provide data for 2014.  
dates = pd.date_range('2014-01-01', '2015-01-01')

# Provide data for just AAPL and MSFT.  
assets = symbols(['AAPL', 'MSFT'])

# The values for Column A will just be a 2D array of numbers ranging from 1 -> N.  
column_A_frame = pd.DataFrame(  
    data=arange(len(dates) * len(assets), dtype=float).reshape(len(dates), len(assets)),  
    index=dates,  
    columns=assets,  
)

# Column B will always provide True for AAPL and False for MSFT.  
column_B_frame = pd.DataFrame(data={assets[0]: True, assets[1]: False}, index=dates)

loaders = {  
    MyDataSet.column_A: DataFrameLoader(MyDataSet.column_A, column_A_frame),  
    MyDataSet.column_B: DataFrameLoader(MyDataSet.column_B, column_B_frame),  
}

# If you're feeling fancy, you could also write this as:  
#     my_dispatcher = loaders.__getitem__  
def my_dispatcher(column):  
    return loaders[column]  

Unfortunately, there are still some obstacles in the way of using this on the research platform:

  1. The SimplePipelineEngine instance used by quantopian.research.run_pipeline is set up behind the scenes, so there's no supported way to pass in your custom dispatch function. The engine construction and storage is hidden by design, since the engine holds references to loaders which in turn hold references to database resources that we don't want to expose publicly for security and performance reasons.
  2. Even if you could construct your own engine on the research platform, the engine needs to additional resources beyond the loader dispatcher, which aren't as easily customized.

Given all of this, my first thought on making this more extensible on Q research would be to add a new optional argument to run_pipeline (e.g. extra_loaders), which would allow users to provide a dictionary mapping columns to custom loaders. Doing so would allow users to inject their own custom datasets into the loader machinery safely, and would even allow users to override the loaders for existing datasets if done correctly.

  • Scott
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Scott. --Grant

Hi Scott,

One use case is to be able to develop and check code that uses pipeline, with known inputs. The question was motivated by Andrew's post, https://www.quantopian.com/posts/factor-tear-sheet. For the specific example, the idea would be to gin up input data sets as source data for the notebook, so that one could see the effectiveness of the calculations on a variety of known input cases. Various effects could be added, including noise, to determine the sensitivity of the analyses in detecting signals. It would also be a way to gain confidence in the code, without having to go through it line-by-line. If I give it simple, known inputs and I get simple, understandable outputs, I can have confidence that it is working.

Regarding memory limitations, do research platform users have the ability to read/write to disk? If so, how much disk space is available? This would be ideal, so that one could create data sets with one notebook, and then read them into multiple notebooks.

Grant

Probably a fee for that.

Hi,
when I tried

from zipline.pipeline.data import Column, Dataset  

it failed to import Dataset

can you please help?

from zipline.pipeline.data import Column, DataSet

is working. DataSet instead of Dataset

here is the working code:

from zipline.pipeline.data import Column  
from zipline.pipeline.data import DataSet

class MyDataSet(DataSet):  
    column_A = Column(dtype=float)  
    column_B = Column(dtype=bool) 

from zipline.pipeline.engine import SimplePipelineEngine  
from zipline.pipeline.loaders.frame import DataFrameLoader  
from zipline.api import symbols

import pandas as pd  
from numpy import arange

# Provide data for 2014.  
dates = pd.date_range('2014-01-01', '2017-01-01')

# Provide data for just AAPL and MSFT.  
assets = ["AAPL","MSFT"]


# The values for Column A will just be a 2D array of numbers ranging from 1 -> N.  
column_A_frame = pd.DataFrame(  
    data=arange(len(dates) * len(assets), dtype=float).reshape(len(dates), len(assets)),  
    index=dates,  
    columns=assets,  
)

# Column B will always provide True for AAPL and False for MSFT.  
column_B_frame = pd.DataFrame(data={assets[0]: True, assets[1]: False}, index=dates)

loaders = {  
    MyDataSet.column_A: DataFrameLoader(MyDataSet.column_A, column_A_frame),  
    MyDataSet.column_B: DataFrameLoader(MyDataSet.column_B, column_B_frame),  
}

# If you're feeling fancy, you could also write this as:  
#     my_dispatcher = loaders.__getitem__  
def my_dispatcher(column):  
    return loaders[column]  

just fixed some small typos and added the needed libraries ;)

Anyone know how to 'attach' or 'register' this 'my_dispatcher' to the pipeline engine?