Notebook

Introduction to Self-Serve Data

Self-Serve Data is a new feature on Quantopian that allows you to bring your own data to Quantopian and access it directly in research and the IDE. You can now include custom datasets or signals in your algorithm for use in the contest and by extension, the allocation process.

Self-Serve Data allows you to upload historical data as well as live update your data on a regular basis. Uploaded data is then accessed in Pipeline.

Getting your own data in Pipeline is a 3-step process:

  1. Format your data.
  2. Upload via Self-Serve Data.
  3. Import and use in Pipeline.

This notebook will walk you through each step using an example campaign contributions dataset.

I. Format Your Data

The first step to uploading your data to Quantopian is to get it into the right format. Currently, the Self-Serve Data feature supports reduced datasets, which means the data needs to be formatted in a .csv file that has a single row of data per asset per day.

Below is a very simple example of a .csv file that can be uploaded to Quantopian. The first row of the file must be a header, and columns are separated by the comma character (,):

date,symbol,signal1,signal2
2014-01-01,TIF,1204.5,0
2014-01-02,TIF,1225,0.5
2014-01-02,AAPL,401.3,-0.1
2014-01-03,TIF,1234.5,0
2014-01-06,TIF,1246.3,0.5
2014-01-06,AAPL,375.0,0.8
2014-01-07,TIF,1227.5,0

For the rest of this notebook, we will use a campaign contributions dataset. To download the dataset (3.8MB), click this link so you can follow the rest of the notebook.

II. Upload Via Self-Serve Data

Once your data is properly formatted and saved as a .csv file, you can upload it to Quantopian. To upload your data, navigate to the Data tab on your account page and click "Add Dataset".

Data Page Upload Data

Now you should be faced with a popup screen that asks you to name your dataset. For this example, let's call our dataset campaign_contributions (this doesn't have to be the name of your file). Select the grey question mark above the text box for a list of naming requirements.

Name Your Data

Next, you will be prompted to upload your .csv file of historical data. You can click or drag to upload the file from your local machine. Select the grey checkmark above the textbook for a list of data requirements.

Upload Historical Data

You will then be prompted to define some of the data in your file with data types. This is needed so that Pipeline can properly interact with the data. Make sure to define the 'Primary Date' and 'Primary Asset' fields, and declare the data types of the other fields. For this example, set date as the Primary Date column, symbol as the Primary Symbol column, and count and sum as numbers (if you don't see these columns, scroll to the right).

Upload Historical Data Upload Historical Data

Finally, you will be asked if you want to set up a nightly update process for the dataset. If you want to live update a dataset, you need to host a file somewhere (like Dropbox or Google Sheets) and keep it up to date. Files are checked for new data on a nightly basis. You can read more about live updating datasets in the help documentation. For this example, we are only going to use a historical dataset so you can select "No Live Data" from the dropdown, then click 'Submit':

Set Up Live Data

Note: Your data, including the name and live component (or lackthereof), cannot be edited after submission. However, you can create a new dataset if you want to change one of these properties.

If your file uploaded successfully, you should see a screen that looks like this:

Success Upload

If the upload failed for some reason, please contact us for help by sending an email to dataupload@quantopian.com.

After adding your dataset, the system will take a few minutes to process your data. It typically takes about 15 minutes after uploading a dataset before it can be accessed in Pipeline. You can monitor the status by checking the load_metrics data in Research.

III. Import and Use in Pipeline

Once your data has been processed (~15 minutes after you upload it), you must restart your notebook before you can access it. To restart your notebook, click the "Run" dropdown in the top-right corner of this screen, and click "Restart" from the list of options. You can then import and use your dataset in Pipeline in either Research or the IDE. After a dataset is processed, it will have a corresponding information page found at https://www.quantopian.com/data/user_[user_ID]/[dataset_name]. In this case, the page can be found at https://www.quantopian.com/data/user_[user_ID]/campaign_contributions. The information page gives us a sample Pipeline that we can use to get started:

Data Page

Let's run the corresponding Pipeline that gets the latest campaign contribution count for each security, every day (dropping null values):

In [1]:
# Pipeline code from /data goes here

Let's break down what's happening in the sample so that we can create our own custom Pipelines.

Self-Serve Data datasets that you add are available in a module that is identified with your unique UserId and is only accessible by you. You can find the import statement for your dataset by navigating to your Data tab and clicking on the name of your dataset.

Note that you must replace the import statement in the next cell with the line you got from your Data tab.

In [1]:
# from quantopian.pipeline.data.user_[user_ID] import [dataset name]
from quantopian.pipeline.data.user_ import campaign_contributions

Alternatively, you can get your import statement more quickly by typing from quantopian.pipeline.data.user_ and then pressing tab to automatically fill in your UserId. You can then type import and press tab again to pull up your list of datasets.

Simple Pipeline Example

Now that you have imported your dataset, you can use it like any other Pipeline dataset. Let's create a simple Pipeline that gets the latest number and sum dollar amount of campaign contributions (under the column named count in the file that we uploaded) from each company in the QTradableStocksUS.

In [2]:
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.filters import QTradableStocksUS

# Current number of open cases for each asset.
num_contributions = campaign_contributions.count.latest
sum_contributions = campaign_contributions.sum.latest

pipe = Pipeline(
    columns={
        'num_contributions': num_contributions,
        'sum_contributions': sum_contributions,
    },
    # Only return results for assets in the QTradableStocksUS that have a non-null
    # number of open cases.
    screen=QTradableStocksUS() & num_contributions.notnull() & sum_contributions.notnull(),
)

# Run the pipeline starting in 2017.
df = run_pipeline(pipe, '2017-01-01', '2018-05-01')
In [24]:
# Preview the results.
df[df.num_contributions != 0].head()
Out[24]:
num_contributions sum_contributions
2017-02-02 00:00:00+00:00 Equity(5763 [FCFS]) 1.0 500.0
2017-02-03 00:00:00+00:00 Equity(28378 [AAWW]) 1.0 2000.0
2017-02-07 00:00:00+00:00 Equity(5310 [NI]) 3.0 7000.0
Equity(6068 [PNC]) 16.0 40500.0
Equity(23221 [KND]) 3.0 12500.0

CustomFactor Example

We can also use the data in built-in and custom Pipeline factors. Here's a slightly more complex Pipeline that gets the number of open cases in the last 63 trading days (~3 months).

In [17]:
from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.filters import QTradableStocksUS

# CustomFactor used to compute the total amount contributed in the 3 months.
class RollingSum(CustomFactor):
    window_length=63
    def compute(self, today, asset_ids, out, values):
        out[:] = values.sum(axis=0)

# Total amount contributed by each asset in the last 3 months.
quarterly_contributions_sum = RollingSum(inputs=[campaign_contributions.sum])

pipe = Pipeline(
    columns={
        'quarterly_contributions_sum': quarterly_contributions_sum,
    },
    # Only return results for assets in the QTradableStocksUS that have a non-null
    # change in number of open cases.
    screen=QTradableStocksUS() & quarterly_contributions_sum.notnull(),
)

# Run the pipeline starting in 2017.
df_rolling_sum = run_pipeline(pipe, '2017-01-01', '2018-05-01')
Out[17]:
quarterly_contributions_sum
2017-04-03 00:00:00+00:00 Equity(2 [ARNC]) 2000.0
Equity(41 [ARCB]) 1000.0
Equity(53 [ABMD]) 0.0
Equity(62 [ABT]) 5000.0
Equity(110 [ACXM]) 0.0
In [19]:
# Preview the results.
df_rolling_sum[df_rolling_sum.quarterly_contributions_sum != 0].head()
Out[19]:
quarterly_contributions_sum
2017-04-03 00:00:00+00:00 Equity(2 [ARNC]) 2000.0
Equity(41 [ARCB]) 1000.0
Equity(62 [ABT]) 5000.0
Equity(128 [ADM]) 13500.0
Equity(161 [AEP]) 15500.0

Backtesting

Since the Pipeline API works in both Research and the IDE (backtesting), your custom data can be used in Pipeline in an algorithm in the same way. See lesson 12 of the Pipeline Tutorial for more information on moving a Pipeline from Research to Backtesting.

Conclusion

With the Self-Serve Data tool you can upload your own data to Quantopian and access it in Research and algorithms via Pipeline. You can add historical data and update it on a nightly basis and build factors that take your data as input.

The full documentation of the Self-Serve Data tool can be found here.

How Does it Work?

In order to accurately represent your data in pipeline and avoid lookahead bias, your data is collected, stored, and surfaced in a point-in-time nature. You can learn more about the process of how this works as well as considerations you should be making when creating your dataset in the next notebook: Self-Serve Data - How does it Work?