Self-Serve Data is a new feature on Quantopian that allows you to bring your own data to Quantopian and access it directly in research and the IDE via Pipeline.
Integration with Pipeline means you can upload a signal to the platform and use it with the rest of the tools in the Quantopian ecosystem like Alphalens and Pyfolio. Accesing your data in Pipeline also means you can use your signal in an algorithm and submit it to the contest and by extension, the allocation process.
This notebook will explore how you can leverage Self-Serve Data to do the following:
If you haven't already done so, download the campaign contributions dataset, upload it to your account, and name the dataset campaign_contributions
. For guidance, check out section 'II. Upload Via Self-Serve' of the notebook in this forum post.
After a dataset is processed, it will have a corresponding information page found at https://www.quantopian.com/data/user_[user_ID]/[dataset_name]
. In this case, the page can be found at https://www.quantopian.com/data/user_[user_ID]/campaign_contributions. The information page gives us a sample Pipeline into which your data is already incorporated as a factor. This example pipeline contains a factor called 'my_dataset'.
Pipeline Factors can be combined both with other Factors and with scalar values via any of the builtin mathematical operators (+, -, *, etc). Factors created from uploaded data work the same way.
To show this, let's make the following changes to this sample Pipeline and analyze a combined factor called 'Score':
SimpleMovingAverage
Factor.Let's create this Pipeline:
# First, import your uploaded dataset:
## from quantopian.pipeline.data.user_[user_ID] import [dataset name]
from quantopian.pipeline.data.user_[user_ID] import campaign_contributions
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import Fundamentals
from quantopian.research import run_pipeline
from quantopian.pipeline.factors import BusinessDaysSincePreviousEvent, SimpleMovingAverage
from quantopian.pipeline.filters import QTradableStocksUS
import pandas as pd
# Set up the Pipeline:
def make_pipeline():
base_universe = QTradableStocksUS()
# Factor for number of business days since data was last updated.
days_since_last_update = BusinessDaysSincePreviousEvent(
inputs=[campaign_contributions.asof_date.latest]
)
# A pipeline screen that ensures uploaded data are not 3 or more business days old.
has_recent_update = (days_since_last_update < 3)
universe = (has_recent_update & base_universe)
# Factor for 6-month simple moving average of campaign contributions count:
score = SimpleMovingAverage(
inputs=[campaign_contributions.count],
window_length= 252/2
)
# Factor for sector code:
sector = Fundamentals.morningstar_sector_code.latest
# Filter out NaNs and 0s
screen_null = score.notnull()
screen_zeros = (score != 0.0)
pipe = Pipeline(
columns={
'Score': score,
'Sector': sector
},
screen= screen_null & screen_zeros & universe
)
return pipe
# Define a time range over which to run the pipeline:
start_date = '2017-01-01'
end_date = '2017-12-31'
results = run_pipeline(make_pipeline(), start_date, end_date)
results.head()
We'll now analyze the 'Score' factor with Alphalens, an open-source tool for analyzing the predictive ability of alpha factors. For more information on Alphalens, check out the Factor Analysis lecture.
The following is the typical workflow for analyzing a Factor with Alphalens:
# Import Alphalens:
import alphalens as al
# Retrieve list of assets from Pipeline output:
asset_list = results.index.levels[1].unique()
# Define time range over which to analyze the factor over:
start_date = '2017-01-01'
end_date = '2018-05-31'
# Obtain pricing information on list of assets for input to alphalens:
prices = get_pricing(
asset_list,
start_date=start_date,
end_date=end_date,
fields='open_price'
)
# Define sector labels for factor analysis grouping:
MORNINGSTAR_SECTOR_CODES = {
-1: 'Misc',
101: 'Basic Materials',
102: 'Consumer Cyclical',
103: 'Financial Services',
104: 'Real Estate',
205: 'Consumer Defensive',
206: 'Healthcare',
207: 'Utilities',
308: 'Communication Services',
309: 'Energy',
310: 'Industrials',
311: 'Technology' ,
}
# First, we get our factor categorized by sector code and calculate our forward returns.
# The forward returns are the returns that we would have received for holding each security over
# the day periods ending on the given date, passed in through the periods parameter.
factor_data = al.utils.get_clean_factor_and_forward_returns(
results['Score'],
prices=prices,
groupby=results['Sector'],
binning_by_group=True,
groupby_labels=MORNINGSTAR_SECTOR_CODES,
quantiles=5,
periods=(10, 21, 63)
)
# Use composed factor data to create full tearsheet
al.tears.create_full_tear_sheet(factor_data, by_group=True);
In the Alphalens tearsheet above which analyzes our 'Score' Factor, we notice mediocre projected returns for all quintiles (Period Wise Return by Factor Quantile plot). Since this factor is determined by the moving average count of campaign contributions, perhaps we can improve this factor by incorporating more information into the factor to increase its predictive value.
For a more detailed review of this factor, refer to the notebook attached to this post by Lucy Wu.
Self-Serve Data allows you to both upload historical data as well as live-update your data on a regular basis. Because of this live-updating capability, your data can be used in algorithms you submit to the daily Quantopian Contest.
When you add a self-serve dataset to your account, you are asked if you want to set up a nightly update process for the dataset:
If you want to send live updates to your dataset, you need to establish an FTP or host a file somewhere (like Dropbox or Google Sheets) and keep it up to date. Files are checked for new data on a nightly basis. You can read more about live updating datasets in the help documentation.
If a live connection is setup, the file posted at the host will be downloaded overnight after each trading day, from 7-10AM UTC (Tue-Fri), and compared against existing dataset records. You can learn more about how this works by reading through the Self-Serve Data - How Does It Work? notebook.
Once you've uploaded your dataset and configured live updates, clone the template algorithm in this thread. Follow the TO-DOs to incorporate your data and develop the algorithm.
Refer to the Writing a Contest Algorithm tutorial to learn more about the contest criteria. Use the notebook in Lesson 11 to test whether your algorithm meets all of the criteria.