Self-serve data query

https://www.quantopian.com/posts/help-accessing-self-serve-data
https://www.quantopian.com/posts/live-custom-data-what-to-expect

Jun 27, 2018

@Chris Myles Was very helpful getting me going in the right direction on this.
I pieced together the approaches in these two posts.

My pipeline basically looks like this:

def my_pipeline(context):  
    pipe = Pipeline()  
    return Pipeline(  
        columns={  
            'asof_date': my_dataset.asof_date.latest,  
            'symbol': my_dataset.symbol.latest,  
            'val1': my_dataset.val1.latest,  
            'val2': my_dataset.val2.latest,  
            'val3': my_dataset.val3.latest,  
            'val4': my_dataset.val4.latest,  
            'val5': my_dataset.val5.latest,  
            'val6': my_dataset.val6.latest,  
        },  
    )  
    return pipe

And I'm doing this:
```

def before_trading_start(context, data):

context.output = pipeline_output('my_pipeline').dropna()  
df = pd.DataFrame(context.output)  
df = df.reset_index()  
print(df)


You'll have to set-up the data as in the posts above and also refer to your unique id and dataset.  
Hope this helps.

Jul 9, 2018

Hi Bryan are you using asof_date to select rows for the backtest date in question?
I have not successfully managed to get only those rows that pertain to the backtest date and wondered if it works for you.
Do let me know when convenient.
Much appreciate
Savio

Jul 9, 2018

Hello Savio.
I decided not to integrate the custom data function into my contest algos at this time. So, the attached algo is rough.
Basically, it pulls data from my dataset on Q named 'my_data'. That data has three columns. 'date', 'symbol' and 'my_values',
It sorts the data by 'asof_date' and only keeps the most recent rows. Note: the index is being reset due to the fact that the original pipeline output index was the symbols mapping ID and I needed access to those for placing trades. That is also why the symbols and values are being added to a list.
Hope it helps!
In one of the linked posts above a method named BusinessDaysSincePreviousEvent is used. You may have a use for it.

For testing, this is what my sample data looks like.

date,symbol,my_values
6/1/2016,GCP,1
6/1/2016,EXPE,2
6/1/2016,DOW,3
6/1/2016,DOC,4
6/1/2016,SHO,5
6/1/2016,INGR,6
6/1/2016,QGEN,7
6/1/2016,AWK,8
6/1/2016,VG,9
6/1/2016,T,10
6/2/2016,CFX,11
6/2/2016,TWTR,12
6/2/2016,CRM,13
6/2/2016,OLED,14
6/2/2016,EVTC,15
6/2/2016,GCP,16
6/2/2016,EXPE,17
6/2/2016,PMT,18
6/2/2016,IBM,19
6/2/2016,FCX,20
6/2/2016,INGR,21
6/2/2016,QGEN,22

Jul 10, 2018

Thank you Bryan - that is most helpful. I will use it to tweak and test my code and I will post the code after I get it working.
Much appreciate, Savio

Robert Petteruti

@Alex:
Self-Serve Data supports reduced custom datasets that have a single row of data per asset per day in .csv file format. This format fits naturally into the expected pipeline format. I see from the code snippet you attached that you intend to identify pairs of assets to trade in a pairs strategy. To better execute this, the following workaround could be helpful: in the offline process of preparing your dataset for upload, you could include a separate column of data mapped to each asset every day that correspond to the ticker symbol of the second asset in each pair (of type String). Here's a sample dataset of two assets, AAPL and INGR, with a value column holding the second asset of each pair called Paired_Symbol:

Date,Symbol,Paired_Symbol  
7/1/2017,AAPL,TSLA  
7/1/2017,INGR,IBM  
7/2/2017,AAPL,FCX  
7/2/2017,INGR,GCP

This column can be included in your strategy's Pipeline as another factor. This way, your strategy can access each pair by indexing into the Pipeline output. Of course, this requires your off-platform data wrangling step to determine each (Symbol,Paired_Symbol) pair. You can of course add other value columns to your uploaded data (per your example, long_1, short_1, long_2, etc.) to specify how to trade each pair.

@Savio and @Bryan, from the conversation above I interpret 2 different intentions:

Retrieving most recent rows in Pipeline output:
It's important to keep in mind that each day in a backtest that constructs a Pipeline daily, the Pipeline output will contain values for only that day. Therefore there's no need to include any logic to retrieve assets for only that day and store them in a separate list.
Retrieving Freshly-Updated Data:
The BusinessDaysSincePreviousEvent Pipeline factor, mentioned by @Bryan in his last response, can be used to check if data in Pipeline output have been updated within a provided threshold number of business days. This ensures that your Pipeline is using data that are not forward-filled with values from some previous date.
The BusinessDaysSincePreviousEvent factor can also be used in Pipelines constructed with data uploaded via Self-Serve Data! I strongly recommend taking a look at the template algorithm attached in a comment to this post. For convenience, here's the template's make_pipeline() function with the factor incorporated in the Pipeline screen, in which only rows of data updated within the last 2 business days are included:

def make_pipeline():  
    """  
    A function that creates and returns our pipeline.

    We break this piece of logic out into its own function to make it easier to  
    test and modify in isolation. In particular, this function can be  
    copy/pasted into research and run by itself.

    Returns  
    -------  
    pipe : Pipeline  
        Represents computation we would like to perform on the assets that make  
        it through the pipeline screen.  
    """  
    base_universe = QTradableStocksUS()  
    # TODO: Replace with your dataset name.  
    days_since_last_update = BusinessDaysSincePreviousEvent(  
        inputs=[[dataset_name].asof_date.latest]  
    )  
    # TODO: Update threshold based on live update frequency  
    has_recent_update = (days_since_last_update < 3)  
    universe = (has_recent_update & base_universe)  
    # TODO: Replace with your dataset name and the name of your value column. In this example, we assume your value column is a number:  
    factor = [dataset_name].value_column.latest

    # Create pipeline  
    pipe = Pipeline(  
        columns={  
            'factor': factor,  
        },  
        screen=factor.notnull() & universe  
    )  
    return pipe

Let me know if this helps!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hello Robert
Thank you - that is most helpful.
You are right - I am having trouble getting only those rows that relate to the backtest date in question to show up ( I am using the debugger to view the rows being picked up).
A theory I have is that the self-serve data comes pre-indexed by symbol.
So I might have to re-index this by asof_date.
Wondered if you know of any post that contains code for this that I can have a look at?
A sample of the data I am using is below.
It could also be that since not all symbols exist for each date I might have to add dummy rows for these symbols for those dates.
When convenient please let me know if I am on the right track or way off in left field.
Much appreciate and a pleasant day, Savio
symbol,date,limit,allocation
AKAM,1/2/2008,35.05,0.02436737
F,1/2/2008,5.26,0.02889153
DASTY,1/3/2008,24.9,0.02739239
F,1/3/2008,5.13,0.02889153
DASTY,1/4/2008,25,0.00817057
TLT,1/4/2008,68.57,0.43264301

Hi Bryan
In your code do you use CurrentDate? I could not find a reference to it after it was assigned.
I am trying to match the current date with asof_date.
Do let me know at your convenience.
Much appreciate and a pleasant day
Savio

Hello Savio. Hope you are well.
You may want to refer to the docs and this thread for a better understanding of how the self-serve data function handles historical data. Since the self-serve data pipeline pulls forward the most recent data you might not need to do any additional date-checking logic.
Although, in my case I do.
Regarding context.CurrentDate = get_datetime().date(). It isn't used in the code above, but is used in another version. I accidentally left it in.
Should you decide to to match the current date with asof_date here are a couple of things to keep in mind.
-For comparing dates you'll need to include "from datetime import datetime, date, time" prior to initialize.
-Using my code above as a reference, once the data is sorted by most recent date you would get the most_recent_date by using "most_recent_date = df'asof_date'.date()". By default the asof_date is a timestamp, so it needs to be converted to a date in order to compare it to "context.CurrentDate", which is a date(type).
-I wouldn't expect for the most recent asof_date to == the current date. The algo will see yesterday's data, in the same way it would access yesterday's OHLC. You wouldn't be able to pass it today's OHLC ; ) At least that's my understanding.
-You may have to account for the gap between Friday and Monday if you will be subtracting dates from each other to determine what yesterday's data is.

You have probably considered these things already.
Hope this helps.
Bryan

Hello Bryan
That was super helpful - thank you for kindly taking the time to give me all that detail.
Will let you know how I make out.
Thanks again.
Much appreciate and a pleasant day
Savio

Savio,
Glad to help.
There's a typo in my previous post.
most_recent_date = df'asof_date'.date()
Should be:
most_recent_date = df['asof_date'][1].date()
This would yield the most recent date of a dataset sorted by asof_date.