Need help with pipeline

Back to Community

posted Nov 19, 2016

I am reading some volatility stategies as it's popular on the site such as
https://www.quantopian.com/posts/trading-vix-quandl-data-now-in-pipeline-for-backtesting-and-live-trading

def initialize(context):  
    """  
    Called once at the start of the algorithm.  
    """  
    # Rebalance every day, 1 hour after market open.  
    schedule_function(my_rebalance, date_rules.every_day(), time_rules.market_open(hours=1))  
    # Record tracking variables at the end of each day.  
    schedule_function(my_record_vars, date_rules.every_day(), time_rules.market_close())  
    # Create our dynamic stock selector.  

    # Create, register and name a pipeline in initialize.  
    pipe = Pipeline()  
    attach_pipeline(pipe, 'example')  
    # Add the TermStrucutreImpact factor to the Pipeline  
    ts_impact = TermStructureImpact()  
    pipe.add(ts_impact, 'ts_impact')  
    # Define our securities, global variables, and schedule when to trade.  
    context.vxx = sid(38054)  
    context.xiv = sid(40516)  
    context.impact = 0

def before_trading_start(context, data):  
    """  
    Called every day before market open.  
    """  
    output = pipeline_output('example')  
    output = output.dropna()  
    log.info(output)  
    context.impact = output["ts_impact"].loc[context.vxx]

But the output part really confuses me. I know every day (every bar) , It did some calculation to update our pipeline. But why the pipeline has to include thousand hundreds of securities for the indexes.(maybe it's a lot of calculation to do the loc[vxx]) Under this circumstance actually it only needs the VXX for the index. Any ideas? Check the plot below for two days output.

pipeline output

3 responses

Dan Whitnable

Nov 19, 2016

The output of a pipeline is a two dimensional Pandas Dataframe. The rows are securities in the Quantopian database (roughly 8300). By default every security is returned. If you set a pipeline screen, only those securities passing the screening filter are returned. The columns are simply "factors" or data associated with each security. These can be the built in factors such as latest close, RSI, etc, or in your example a custom factor which returns a vxx calculation. The output is the factor data for one day (the backtest day). In the research environment the resulting dataframe can include multiple days so the output returns a hierarchical index (day + security) so the dataframe can be thought of as having 3 dimensions. See https://www.quantopian.com/help#using-results in the documentation.

All the factors and associated calculations are done for EVERY security in a pipeline. That's just the way it works. It might seem like overkill if all you want is one piece of data but all that processing is free so don't worry about it. Simply pull the piece(es) of data you need from the output. In this case context.impact = output["ts_impact"].loc[context.vxx] will get the single piece of data from column "ts_impact" of data where the row is the security sid=38054 (because context.vxx = sid(38054)).

As an aside, using VXX data (and other data not specifically associated with a security) will yield the same factor value for each day for every security in the pipeline output. The link in the previous post shows this. Every security has the same column value each day.

Make sense?

Eric K

Nov 19, 2016

@Dan It might seem like overkill if all you want is one piece of data but all that processing is free ? So to say in live trading will the 8300 data calculation waste some time maybe?

Dan Whitnable

Nov 19, 2016

Unless you start getting timeout errors don't worry about wasting calculation time.

One could minimize the time spent in a factor "compute" method if desired. Pass a mask to the factor which will then skip the compute calculation for those securities not passing the mask filter. Those securities simply return a NaN instead. Again, generally one doesn't need to care unless the algorithm explicitly begins giving timeout errors.

The alternative would be to get data programmatically outside of the pipeline. However, only price data (via the history method) and fundamental data (via the get_fundamentals method) are available this way (VXX is only available using the pipeline) Additionally, it's not certain that would actually be more efficient.The database calls behind the scenes potentially are optimized and/or cached in pipeline for retrieving all security data and may actually be more efficient then doing one at a time calls.

Don't worry about impact during live trading if you're not getting timeout errors. Foremost, design your code to be readable, re-usable, and to rely on the built in (and presumably well tested) methods. ONLY if you begin getting errors, then start looking into "efficiency".

You've successfully submitted a support ticket.

Our support team will be in touch soon.