Notebook

Self-Serve Data - How does it work?

When you upload your own data using Self-Serve, your data is collected, processed, and surfaced in Pipeline in a point-in-time nature. In this notebook, we will explore the ways in which your data is processed. Considerations you should make when creating a dataset on Quantopian, and ways to monitor the operational integrity of your dataset. The notebook is divided into 3 sections:

 1. How is Your Data Processed?
 2. Considerations When Creating Your Dataset.
 3. Checks and Monitoring.

Each section uses the campaign contributions dataset from the previous notebook: 'An Introduction to Self-Serve Data'.

If you haven't already done so, download and upload the campaign contributions dataset, naming the dataset campaign_contributions. Check out section 'II. Upload Via Self-Serve' of the previous notebook for guidance.

1. How is Your Data Processed?

Data that is uploaded via Self-Serve is processed in two major steps. First, the data is mapped to a standardized security identifier. Then, it is timestamped and stored in such a way that prevents it from being used with inadvertent lookahead bias.

Security Identifiers

Self-Serve Data makes it easy to upload data to Quantopian and use them alongside the pricing, fundamental, and alternative data that are already available in Pipeline. This is made possible by mapping all datasets to a common set of asset identifiers. On Quantopian, assets are mapped to a security identifier (SID), which are part of every dataset available in Pipeline.

When a dataset is uploaded via Self-Serve Data, Quantopian runs a process that maps ticker symbols (which can vary from source to source) to a SID which is standardized across all Quantopian datasets. Once each row of the dataset is mapped to a SID, Pipeline knows how to align your data with other datasets so you can build factors and filters using your own data, Quantopian's data, or a mix of both.

Timestamps

When researching or testing a data-driven trading strategy, it is important to know the time at which you would have learned about data, so that you can avoid introducing look-ahead bias in your work. Like other datasets on Quantopian, Self-Serve enables you to avoid lookahead bias by adding a timestamp datetime value to each row of new data whenever your dataset is updated. The timestamp informs Pipeline about when a data point was learned, and Pipeline only makes data available in a simulation after the timestamp has passed.

Live-updating Self-Serve Data datasets are timestamped whenever new data is added, or prior data is overwritten. When new data is added, the new rows in the dataset are created with a timestamp reflecting the date and time at which they were added. When an existing row is edited (for example, if a previously reported value is corrected), the old value will still be saved in a Quantopian database with its original timestamp. The new value will also be saved with the new timestamp and is referred to as a delta. Pipeline will surface the original data point in a simulation until it reaches the timestamp of the update, at which point it will reveal the new value.

Historical data uploads are treated a bit differently in Self-Serve Data. Since most datasets don't include timestamps reflecting when they were learned, we have to make assumptions about when the data might have been learned. When you create a dataset and make an initial historical data upload via Self-Serve, your data will be timestamped with a default 1-day lag, meaning each data point will be surfaced in a Pipeline simulation the day after its 'primary date' (this was the date column you selected when uploading your dataset).

What Does This Look Like?

Recall that when we uploaded the campaign_contributions.csv file, the file looked something like this:

date,symbol,count,sum
2017-01-01,AA,0,0
2017-01-01,AAN,0,0
2017-01-01,AAWW,0,0
2017-01-01,ABBV,0,0
2017-01-01,ABMD,0,0
2017-01-01,ABT,0,0

And when we uploaded it, we labeled the columns:

  • Primary Date: Identifies the date for which a data point applies.
  • Primary Asset: Identifies the asset for which a data point applies.
  • Value(s): One or more value columns.

Upload Historical Data

To get a better sense of what your data looks like after it has been processed, you can import the interactive version of your dataset to query the read-only database in which it is stored. Below we import the interactive version of the campaign_contributions dataset used in the first notebook, 'Self-Serve Data'.

In [3]:
# You can get your import statement more quickly by typing from quantopian.interactive.data.user_ 
# and then pressing tab to automatically fill in your UserId.

# from quantopian.interactive.data.user_[user_ID] import [dataset name]
from quantopian.interactive.data.user_ import campaign_contributions

Interactive datasets are technically Blaze expressions. Blaze expressions have a similar API to pandas, with some differences. We can view the first few rows of the dataset by using the .peek() function:

In [4]:
# View data immediately after upload:
campaign_contributions.peek()
Out[4]:
symbol count sum sid asof_date timestamp
0 ARNC 0.0 0.0 2 2017-01-01 2017-01-02
1 ARCB 0.0 0.0 41 2017-01-01 2017-01-02
2 ABMD 0.0 0.0 53 2017-01-01 2017-01-02
3 ABT 0.0 0.0 62 2017-01-01 2017-01-02
4 ACXM 0.0 0.0 110 2017-01-01 2017-01-02
5 ADM 0.0 0.0 128 2017-01-01 2017-01-02
6 AEP 0.0 0.0 161 2017-01-01 2017-01-02
7 AES 0.0 0.0 166 2017-01-01 2017-01-02
8 AET 0.0 0.0 168 2017-01-01 2017-01-02
9 AFL 0.0 0.0 185 2017-01-01 2017-01-02
10 AIG 0.0 0.0 239 2017-01-01 2017-01-02

If we compare the interactive dataset to the .csv file that we uploaded to Self-Serve, we can see a few differences:

  1. A new column sid is added, which maps each asset to its SID in the Quantopian system.
  2. The Primary Date column Date is now labeled asof_date.
  3. A new column timestamp is added. This column is used to represent the time the Quantopian system learns about each datapoint (or an estimation of this time in a historical data upload). Pipeline does not surface data until the simulation date is after its timestamp.

2. Considerations When Creating Your Dataset

These are some important considerations to make to ensure data are uploaded as expected and without error:

US equities only: Currently, Self-Serve only knows how to map US equities to a SID by using a ticker symbol (and the primary date). At some point in the future, we plan to add support for more identifiers and more asset types.

Avoid lookahead data: The Primary Date column should never include values dated after the current day.

Trade dates: If you want your data to be available in a Pipeline simulation on the Primary Date (instead of the following day), you should subtract 1 trading day from the Primary Date column before you upload it, since it will be timestamped 1 day after the Primary Date. This is sometimes desirable if the Primary Date reflects a trade date from a broker account history.

Timestamped data: If your Primary Date column is in datetime format, the 1 trading day adjustment may improperly shift the timestamp past the expected pipeline start time. Data must have an asof_date prior to midnight (23:59:59.99) to be included in pipeline the next day.

Other Formatting Considerations: Check out the help documentation for descriptions of acceptable formats for your data.

Note: Your data, including name and live component, cannot be edited after submission.

3. Checks and Monitoring

You can easily monitor the details and status of your historical loads and/or live updates. Let's import the interactive load_metrics dataset to view some important columns you can use to ensure a load occurs as expected.

In [5]:
# from quantopian.interactive.data.user_[user_ID] import load_metrics
from quantopian.interactive.data.user_ import load_metrics
In [6]:
# Import required modules
import pandas as pd
import numpy as np
from odo import odo
In [10]:
# View select columns of the load_metrics table
lm = odo(load_metrics[['timestamp','dataset','status','rows_received','rows_added'
                      ,'delta_rows_added','last_updated','time_elapsed'
                      ,'filenames_downloaded','source_last_updated','bytes_downloaded'
                      ,'db_table_size','error'
                     ]].sort('timestamp',ascending=False), pd.DataFrame)

lm.groupby('dataset').head()                       # This is a way to view latest updates for each dataset
Out[10]:
timestamp dataset status rows_received rows_added delta_rows_added last_updated time_elapsed filenames_downloaded source_last_updated bytes_downloaded db_table_size error
3 2018-06-11 20:06:43.642225 campaign_contributions completed 163885 161780 0 2018-06-11 20:07:00.753502 17.253879 campaign_contributions.csv NaT 3805267 12419072 None

Use the following fields to monitor your data uploads:

  • filenames_downloaded: name(s) of the file(s) downloaded
  • rows_received: Total number of rows downloaded from historical or FTP
  • rows_added: Number of new records added to dataset for pipeline ingestion
  • total_rows: Total number of rows in dataset after load completes
  • delta_rows_added: Number of new records added to the deltas table
  • total_delta_rows: Total number of rows in the deltas table after the load completes
  • timestamp: start time of the upload
  • time_elapsed: Number of seconds it took to process the data during upload
  • last_updated: latest timestamp for records added to dataset before load completes
  • source_last_updated: file timestamp on the source FTP file (live loads only)
  • status: displays running, empty (no data to process), failed, completed
  • error: error message for failed uploads

If data that already exists in a dataset is uploaded, the new data is added to a separate 'deltas' table (explained here). Because new values are added to these tables instead of overwriting old data, deltas tables help in maintaining a point-in-time record of your data.

You can view the number of rows that were added to a delta table for each dataset with the 'delta_rows_added' field. It is often a good idea to check this field to make sure you aren't accidentally updating prior values.

In [11]:
# from quantopian.pipeline.data.user_[user_ID] import [dataset name]_deltas
from quantopian.interactive.data.user_ import campaign_contributions_deltas
# View delta table immediately after upload
campaign_contributions_deltas.head()
Out[11]:
symbol count sum sid asof_date timestamp

Pipeline applies a dataset's deltas table to the data it surfaces, so it will replace older values with newer values on any day.

Conclusion

Self-Serve processes any data that you upload, adding standardized security identifiers for easy comparison to other datasets and timestamps to prevent lookahead bias. There are considerations that you should make before uploading data to ensure that it is processed as you expect it to be. You can also analyze and monitor your loads via the interactive version of your data.

The full documentation of the Self-Serve Data tool can be found here

Why Self-Serve Data?

Self-Serve Data seamlessly integrates into Pipeline and provides your strategy with completely custom signals. For a look into how this tool enriches your strategy development, check out the next notebook: 'Why Self-Serve Data?'.