Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Length inconsistency of Pipeline output using self-driven data

My self-driven data contains around 400 assets and have 3 columns ['date','assets','alpha'].
When I added my alpha to pipeline and run it as follows. It returns data frame of shape (354,1), which makes sense.
res1 = run_pipeline(make_pipeline(),'2019-01-01','2019-01-01').dropna()

But When I check res1 .index.levels[1].nunique(), the result is 9359, which is weird.

Can anyone help me to explain the inconsistency of unique assets?

2 responses

The output of a pipeline is a dataframe. This outputted dataframe is indexed by security and date (just security within an algo). A key concept is the index is fixed. It contains all the securities which Quantopian tracks on any given date. This is the 'Quantopian Universe and contains roughly 9000+ securities and does vary by date. One can add a 'screen' to a pipeline to filter out specific securities, but one can't add securities. The index (ie the securities or rows) are fixed.

One can get a sense of how this works by running a pipeline with no columns (ie factors) and no screen. Something like this in a notebook

base_pipeline = Pipeline( columns=None, screen=None)  
result = run_pipeline(base_pipeline, '1-1-2018', '1-15-2018')  
display(result)

The result will be a dataframe with no columns and a multi-index. Level 0 of the index will be the trading days between '1-1-2018' and '1-15-2018'. Level 1 of the index will be all the security objects which Quantopian tracks for the given date.

When importing self-serve data all that one is doing is adding columns to the pipeline dataframe. One isn't ever adding rows or securities. The import logic simply finds the dataframe index, specified by date and security in the data file, and assigns the data from the file to the associated column. If a security is not in the self-serve data file the associated column will contain a value of nan. One isn't adding rows to the dataframe one is only adding columns.

So, unless there is a screen added to a pipeline, the number of unique securities (ie level 1 values) will be the number of securities in the Quantopian 'universe'. The inconsistency which is noted is because the first query added the dropna() method. Since data was provided for only a subset of securities, any security not in the file has the value of nan for the imported column. Those securities were therefore dropped leaving just the 354 securities.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Very Clear, thanks Dan!