Help -- Accessing Self Serve Data

Back to Community

edited Jun 23, 2018

Hello.
How do I bring in and access custom data that has been uploaded to my Datasets?

My .csv data looks like this:
date,symbol,val1,val2,val3,val4,val5,val6
6/1/2016,sid(45558),4,6,4,2,1,0.6441
6/1/2016,sid(1753),3,5,5,2,0,0.5359
6/1/2016,sid(10649),4,9,5,2,0,0.4989

And the code I'm using is this. How would you pull the data through the pipeline such that it is available in BTS?
My pipeline code is cringeworthy.
Thank you!

import pandas as pd  
from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline  
from quantopian.pipeline.data.user_myAlphaNumericUserID import theNameOfMyDataset as my_dataset 

def initialize(context):

    context.FirstDateOfBacktest = get_environment('start').date()  
    attach_pipeline(my_pipeline(context), 'my_pipeline')  
def before_trading_start(context, data):

    context.CurrentDate = get_datetime().date()

    if context.CurrentDate == context.FirstDateOfBacktest:  
        context.output = pipeline_output('my_pipeline')  
        print (context.output)  

def my_pipeline(context):

    data = my_dataset  
    pipe = Pipeline()  
    return Pipeline(  
        columns={  
            'date': date,  
            'symbol': symbol,  
            'val1': val1,  
            'val2': val2,  
            'val3': val3,  
            'val4': val4,  
            'val5': val5,  
            'val6': val6,  
        },  
    )  
    return pipe

6 responses

Chris Myles

Jun 24, 2018

Hi Bryan,

I think you have a couple of issues compounding here.

Your symbol column should be ticker not sid(nnn). This is causing 0 rows to be processed through symbol mapping so no data is available for pipeline. You can use load_metrics rows_added or Research Interactive to validate that your initial data load worked as expected.
In your pipeline example, the columns are missing the reference to your dataset and .latest, so Pipeline doesn't know where val1,val2` etc are defined.

'val1': my_datset.val1.latest should work in the example above, but I would start in research to make sure you are getting the data you are expecting, before moving to your algo.

FYI There is a How to use pipeline example available on your dataset page, by clicking on the dataset name in the Self-Serve Data section. That should be a good starting point.

I've attached a notebook that contains the base load_metrics check, the pipeline example from the How to use code and the interactive example as a reference. You'll need to change to your UserID but that is how I double check all my Self-Serve datasets when I first load them.

Hope that helps, let me know if you have other questions

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Bryan Richardson

Jun 24, 2018

@Chris Myles Thank you for the response.
I spent a few hours studying the docs, examples and experimenting before making my original post and continue to do so. Unfortunately, I'm not quite there yet.
The data has been changed and now uses ticker symbols instead of sids.
Here is the data being loaded in total from a .csv.

date,symbol,val1,val2,val3,val4,val5,val6
6/1/2016,AAPL,4,6,4,2,1,0.6441
6/1/2016,IBM,3,5,5,2,0,0.5359
6/1/2016,DOW,4,9,5,2,0,0.4989

The 'date' column is set as 'Primary Date' and the 'symbol' column is set as 'Primary Asset'. All of the other columns are set to 'number' as column type.
Now, when I run this code in a backtest beginning on 06/02/2016...

import pandas as pd


from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline  
from quantopian.pipeline.data.user_5a7367d3e957ab0012bf69a4 import symbols_6__2016 as my_dataset 

def initialize(context):  
    context.FirstDateOfBacktest = get_environment('start').date()  
    attach_pipeline(my_pipeline(context), 'my_pipeline')  

def before_trading_start(context, data):

    context.CurrentDate = get_datetime().date()

    if context.CurrentDate == context.FirstDateOfBacktest:  
        context.output = pipeline_output('my_pipeline')  
        print (context.output)  

def my_pipeline(context):  
    pipe = Pipeline()  
    return Pipeline(  
        columns={  
            #'date': my_dataset.date.latest,  
            'symbol': my_dataset.symbol.latest,  
            'val1': my_dataset.val1.latest,  
            'val2': my_dataset.val2.latest,  
            'val3': my_dataset.val3.latest,  
            'val4': my_dataset.val4.latest,  
            'val5': my_dataset.val5.latest,  
            'val6': my_dataset.val6.latest,  
        },  
    )  
    return pipe

This is what the log shows.

2016-06-02 05:45 PRINT symbol val1 val2 val3 val4 val5 val6
Equity(2 [ARNC]) None NaN NaN NaN NaN NaN NaN
Equity(21 [AAME]) None NaN NaN NaN NaN NaN NaN
Equity(24 [AAPL]) AAPL 4.0 6.0 4.0 2.0 1.0 0.6441
Equity(25 [ARNC_PR]) None NaN NaN NaN NaN NaN NaN
Equity(31 [ABAX]) None NaN NaN NaN NaN NaN NaN
Equity(39 [DDC]) None NaN NaN NaN NaN NaN NaN
Equity(41 [ARCB]) None NaN NaN NaN NaN NaN NaN
Equity(52 [ABM]) None NaN NaN NaN NaN NaN NaN
Equity(53 [ABMD]) None NaN NaN NaN NaN NaN NaN
Equity(62 [ABT]) None NaN NaN NaN NaN NaN NaN
Equity(64 [ABX]) None NaN NaN NaN NaN NaN NaN
Equity(66 [AB]) None NaN NaN NaN NaN NaN NaN
Equity(67 [ADSK]) None NaN NaN NaN NaN NaN NaN
Equity(69 [ACAT]) None NaN NaN NaN NaN N...

Any thoughts on the correct way to pull the original data into a backtest?

Thank you

P.S. The reason why the line #'date': my_dataset.date.latest, is commented out is because when it is commented in I get the following error
AttributeError: type object 'symbols_6__2016' has no attribute 'date'
USER ALGORITHM:28, in my_pipeline
'date': my_dataset.date.latest,

James Villa

Jun 24, 2018

Hi Bryan,

Try to add dropna() to your values in the pipeline to exclude NaNs, i.e:

'val1': my_dataset.val1.latest.dropna(),

Hope this helps.

Blue Seahawk

Jun 24, 2018

context.output = pipeline_output('my_pipeline') .dropna() # all in one fell swoop

Bryan Richardson

Jun 24, 2018

Hi James. Thank you for taking the time to reply. Your forum posts are really good.
Using the dropna() method like this:
'val1': my_dataset.val1.latest.dropna(),

Returns the following error:

AttributeError: 'Latest' object has no attribute 'dropna'
There was a runtime error on line 10.

Will keep hacking away until this works.
Have a great Sunday!
Bryan

Bryan Richardson

Jun 24, 2018

@Blue Seahawk and @James Villa
Thank you both. This worked!
Cheers

Breadcrumbs for the next person that wants to do something similar.

import pandas as pd


from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline  
from quantopian.pipeline.data.user_5a7367d3e957ab0012bf69a4 import symbols_6__2016 as my_dataset 

def initialize(context):  
    context.FirstDateOfBacktest = get_environment('start').date()  
    attach_pipeline(my_pipeline(context), 'my_pipeline')  

def before_trading_start(context, data):

    context.CurrentDate = get_datetime().date()

    if context.CurrentDate == context.FirstDateOfBacktest:  
        context.output = pipeline_output('my_pipeline').dropna() # all in one fell swoop  
        print (context.output)

def my_pipeline(context):  
    pipe = Pipeline()  
    return Pipeline(  
        columns={  
            #'date': my_dataset.date.latest,  
            'symbol': my_dataset.symbol.latest,  
            'val1': my_dataset.val1.latest,  
            'val2': my_dataset.val2.latest,  
            'val3': my_dataset.val3.latest,  
            'val4': my_dataset.val4.latest,  
            'val5': my_dataset.val5.latest,  
            'val6': my_dataset.val6.latest,  
        },  
    )  
    return pipe

You've successfully submitted a support ticket.

Our support team will be in touch soon.