Pipeline frequency in research

Back to Community

edited

Is there a way to have a pipeline in a notebook output at a lower frequency than every day? I'm trying to do some longer term research and it's a bit too slow so I would rather just output once a month. I tried schedule_function but that didn't seem to work.

11 responses

Eric Bell

When I run the pipeline in research just to make sure I can get output, I limit the days using run_pipeline

myPipe=run_pipeline(pipe, start_date='2015-04-19', end_date='2015-04-21')
myPipe
This will only give me an output for 3 days.

if you want to see only the information on 4/20/15 I use

myPipe.xs('2015-04-20').dropna()

schedule_function, I believe only works in algorithm mode.

I attached my standard notebook that I use when I start a pipeline strategy.

Bing Wong

I've created a few functions that might help you out.

The following function takes a list of dates (timestamps), and runs the pipeline for each date, and then returns a single merged dataframe of that info.

def run_pipeline_list(pipe, pipeline_dates):  
    """  
    Drop-in replacement for run_pipeline.  
    run_pipeline fails over a very long period of time (memory usage),  
    so we need to split in chunks the pipeline and concatenate the results  
    """  
    chunks  = []  
    for date in pipeline_dates:  
        print "Processing {:.10} pipeline".format(str(date[0]))  
        results = run_pipeline(pipe, date[0], date[0])  
        for col in list(results.select_dtypes(include=['category']).columns):  
            results[col] = results[col].astype('O')  
            # convert category dtype to Object  
        print'shape: {1}\n'.format(str(date[0]),results.shape)  
        chunks.append(results)  
    try:  
        print '\nCombined dataframe created'  
        return pd.concat(chunks)  
    except:  
        print '\npd.concat failed'  
        return chunks

This function generates a list of dates (e.g. Tuesday of every week) to be fed into the above function. You can easily modify the code for end of month dates.

def dt_intervals(beg_date, end_date, day_index):  
    ''' Creates datetime intervals to save on memory  
        parameters:  
        ----------  
        day_index => day of the week to run calculations  
        0-mon, 1-tues, 2-wed, . . . 4-fri  
        returns: datetime index  
    '''  
    trng = pd.date_range(beg_date, end_date)  
    cal = USFederalHolidayCalendar()  
    holidays = cal.holidays(start=trng.min(), end=trng.max()) # list of holidays occuring btn start/end dates  
    trng_no_holidays = trng[~trng.isin(holidays)]  # make sure eligible valid date df excludes US holidays  
    trng_no_holidays_wknds = trng_no_holidays[trng_no_holidays.weekday < 5]  # exclude saturday/sunday; != 5/6  
    pipeline_dates = []

    for year in set(trng_no_holidays_wknds.year):    # set -> get unique year values in time range  
        tmp = trng_no_holidays_wknds[trng_no_holidays_wknds.year == year] # slice year wise  
        for week in set(tmp.week):  # for each week in slice  
            temp = tmp[tmp.week == week]  
            day = temp[temp.weekday == day_index]  # select a day of the week 0-monday, 1-tuesday . . . 4-friday  
            if len(day) == 1: pipeline_dates.append(day)  
            else: pipeline_dates.append(temp[temp.weekday == temp.weekday.max()]) # last day of week if problem  
            # pipeline_dates.append(temp[temp.weekday == temp.weekday.min()]) # begining of week  
    return sorted(pipeline_dates)

Daniel Potts

Great solution, thanks!

Dan Whitnable

Be careful.
Pipeline data is split adjusted for each date. Comparing pipeline data across dates will be misleading if splits occur.

*** below is NOT correct as originally posted but I left in so the remaining posts make sense. see notebook a few posts below***
The results from pipeline are all split adjusted as of the end_date of run_pipeline method in research. This means that 'piecing' separate intervals of data together as above can result in data which isn't split adjusted as of a common date and will be incorrect.

Bing Wong

Agreed. Merging pipeline chunks should not be done with anything using share counts or pricing data. Non per share, fundamental data line items should be fine.

Luca

@Dan, are you sure about the end_date? I believe each factor receives as its input split adjusted prices for the day it computes its values. That means every single day in the pipeline time frame the prices will be split adjusted for that day and then given in input to each factors requesting them. That's the reason why when using factors as input to other factors you cannot rely on those values unless they are marked window_safe.

EDIT: I never tested it though. I just thought that was more reasonable (correct) to split-adjust every day than doing it all in once for the whole pipeline period.

Dan Whitnable

@Luca

I didn't actually test the pipeline split adjusting in a notebook (which I usually do before responding in the forums here), however, this behavior was explained by Jamie McCorriston in this post https://www.quantopian.com/posts/help-notebook-vs-algorithm. If Jamie says it's true I'll bet it's true. :)

Luca

Thanks Dan, I missed that explanation. I don't like this behavior then, it's not correct. I will do some tests in Research to verify it.

Let's say we build a pipeline filter on the close price, if the price is split adjusted for the full pipeline timeframe the filter will not work correctly.

Dan Whitnable

@Luca

I believe you're correct! (and that's the last time I'll post without trying first). See attached notebook. It appears that pipeline returns prices as of that day. As you stated it gives data to the factors split as of that day. This makes using multi-day pipeline data problematic in the notebook environment. Probably a better approach would be to use the 'get_pricing' function which does adjust for splits.

See attached notebook.

Luca

Thanks Dan, I've just run my own test and I jumped to the same conclusion :)

This is good news, this is the correct behaviour for pipeline. You cannot calculate the correct factors if price/volume are split -adjusted for a future date. This is also consistent with window_safeand with algorithm behaviour.

Of course we are required to know about possible split/merge/dividend when comparing prices coming from different runs of pipeline, but we have to deal with that.

Jamie McCorriston

Sorry guys! That was my mistake. I've corrected my post in the other thread. Only get_pricing and history (in quantopian.research.experimental) are adjusted as of the end_date. Pipeline is always adjusted as of the 'current' date (either in a backtest or a run_pipeline simulation).

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

You've successfully submitted a support ticket.

Our support team will be in touch soon.