Notebook

Domains

Today, we announced the addition of global equity pricing and fundamentals data to Quantopian. With the addition of international equity data comes a new Pipeline API feature called domains. In this notebook, we introduce the concept of domains and explain how they work with examples.

Outline

  • Revisit the problem that the Pipeline API was designed to solve.
  • Introduce the concept of domains.
  • Demonstrate how the domain of a pipeline affects the inputs to pipeline computations.
  • Introduce the concepts of 'generic' datasets and domain inference.

TL;DR

  • The Pipeline API now supports international pricing and fundamental data.
  • You can change the market that a Pipeline executes on using new "domain" objects.
  • New "generic" datasets make it easy to run the same pipeline on multiple domains.
  • Use the new EquityPricing dataset instead of USEquityPricing for Pipelines that you want to run on non-US domains.

Background

The Pipeline API is a tool that allows you to define computations over a universe of assets and a period of time. In the context of the quant equity workflow, you can use the Pipeline API to construct a tradable universe and to compute alpha factors.

The computations defined by the Pipeline API operate on two-dimensional tables of financial data. These tables contain a row for each trading day of a particular market, and they usually contain a column for every asset on that market. For example:

date AAPL ARCN ... MSFT TSLA
2014-01-02 5.5 2.0 ... 5.5 4.23
2014-01-03 5.6 2.1 ... 5.6 4.18
... 5.7 2.2 ... 5.7 3.96
2014-12-31 5.8 2.3 ... 5.8 4.25

Until today, Quantopian only provided US Equity data, so the rows of pipeline inputs were always implicitly aligned to the US trading calendar (specifically, the NYSE calendar), and the columns of pipeline inputs always corresponded to US equities.

With the addition of global markets to the Quantopian platform, we needed to extend the Pipeline API in two ways:

  1. We needed to be able to run Pipelines over data aligned to calendars other than the US trading calendar.
  2. We needed to be able to run Pipelines on trading universes other than US Equities.

Domains are a new Pipeline API feature designed to satisfy both of these needs.

What Are Domains?

Domains are a new kind of object in the Pipeline API. You can pass a domain when constructing a pipeline to change the inputs that will be processed by the pipeline. Concretely, the domain of a pipeline controls three things::

  1. The calendar to which the pipeline's input rows are aligned.
  2. The set of assets to which the pipeline's input columns are aligned.

There are currently 21 domains available for use on the Quantopian platform, corresponding to the 21 countries supported by this release. Each new domain has two components:

  1. An ISO 3166 country code, that determines the assets associated with the domain.
  2. A trading calendar, that defines how we align daily data for the domain. For countries with more than one equity exchange, we've chosen the calendar of the largest exchange within that country.

For the mathematically-inclined, the name "domain" refers to the mathematical concept of the domain of a function, which is the set of potential inputs to a function. Though domains currently only control

For more information about the design of domains, see the public design document on GitHub.

Working with Domains

Domains are regular Python objects. The currently-supported domains are importable from quantopian.pipeline.domain.

Each country's domain is named **_EQUITIES, with ** replaced by the country's ISO 3166 country code.

In [1]:
from quantopian.pipeline.domain import (
    AT_EQUITIES,  # Austria
    AU_EQUITIES,  # Australia
    BE_EQUITIES,  # Belgium
    CA_EQUITIES,  # Canada
    CH_EQUITIES,  # Switzerland
    DE_EQUITIES,  # Germany
    DK_EQUITIES,  # Denmark
    ES_EQUITIES,  # Spain
    FI_EQUITIES,  # Finland
    FR_EQUITIES,  # France
    GB_EQUITIES,  # Great Britain
    HK_EQUITIES,  # Hong Kong
    IE_EQUITIES,  # Ireland
    IT_EQUITIES,  # Italy
    JP_EQUITIES,  # Japan
    NL_EQUITIES,  # Netherlands
    NO_EQUITIES,  # Norway
    NZ_EQUITIES,  # New Zealand
    PT_EQUITIES,  # Portugal
    SE_EQUITIES,  # Sweden
    US_EQUITIES,  # United States
)

# The string representation for each domain shows the ISO code for its country and for the exchange
# that defines its calendar.
US_EQUITIES
Out[1]:
EquityCalendarDomain('US', 'XNYS')

The domain of a pipeline defines the calendar and assets to use as input to each of its computations. A good way to see what this looks like is to run an empty pipeline on different domains.

To specify the domain of a pipeline, we pass domain as a named argument to the Pipeline constructor. Here's what it looks like to run a Pipeline on the Canadian equity domain:

In [2]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.domain import CA_EQUITIES
from quantopian.research import run_pipeline

pipe_ca = Pipeline(columns={}, domain=CA_EQUITIES)

df_ca = run_pipeline(pipe_ca, '2015-01-01', '2016-01-01')
df_ca.head()
Out[2]:
2015-01-02 00:00:00+00:00 Equity(1178883868150594 [KAR])
Equity(1178884003878983 [PNC.A])
Equity(1178884038411085 [RST])
Equity(1178887761843025 [IES])
Equity(1178888179962698 [KOT])

The large integer in each Equity's string representation is the security identifier (SID) for that equity. New international equities have long sids to avoid collisions with previously-existing US sids.

For comparison, here's what an empty Pipeline looks like with the US_EQUITIES domain:

In [3]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.domain import US_EQUITIES
from quantopian.research import run_pipeline

pipe_us = Pipeline(columns={}, domain=US_EQUITIES)

df_us = run_pipeline(pipe_us, '2015-01-01', '2016-01-01')
df_us.head()
Out[3]:
2015-01-02 00:00:00+00:00 Equity(2 [ARNC])
Equity(21 [AAME])
Equity(24 [AAPL])
Equity(25 [ARNC_PR])
Equity(31 [ABAX])

The difference in equities is obvious. The calendar difference is a little more subtle - especially between Canada and the United States, which have very similar holiday schedules. To see the difference in trading calendars, let's focus on one asset in each market in the month of October.

In [4]:
df_ca.loc[(slice('2015-10-08', '2015-10-14'), 1178883868150594), :]
Out[4]:
2015-10-08 00:00:00+00:00 Equity(1178883868150594 [KAR])
2015-10-09 00:00:00+00:00 Equity(1178883868150594 [KAR])
2015-10-13 00:00:00+00:00 Equity(1178883868150594 [KAR])
2015-10-14 00:00:00+00:00 Equity(1178883868150594 [KAR])
In [5]:
df_us.loc[(slice('2015-10-08', '2015-10-14'), 24), :]
Out[5]:
2015-10-08 00:00:00+00:00 Equity(24 [AAPL])
2015-10-09 00:00:00+00:00 Equity(24 [AAPL])
2015-10-12 00:00:00+00:00 Equity(24 [AAPL])
2015-10-13 00:00:00+00:00 Equity(24 [AAPL])
2015-10-14 00:00:00+00:00 Equity(24 [AAPL])

October 12th, 2015 (Monday) appears in the output of the US_EQUITIES pipeline but not in the output of the CA_EQUITIES pipeline. This happens because Canadian markets have a holiday (Canadian Thanksgiving) on the second Monday of October.

Defining Output Data

So far we've only seen examples of running pipelines with empty outputs. Of course, most Pipeline API users don't just want empty outputs: they want to compute things!

If we add output columns to a pipeline with a domain, then the values of the outputs will be computed using data from that domain:

In [6]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import EquityPricing
from quantopian.pipeline.factors import Returns
from quantopian.pipeline.domain import CA_EQUITIES
from quantopian.research import run_pipeline

pipe_ca_with_data = Pipeline(
    {
        # 5-day returns over the most recent five Toronto Stock Exchange trading days.
        'returns_5d': Returns(window_length=5),
        'volume': EquityPricing.volume.latest,
    }, 
    domain=CA_EQUITIES,
    screen=EquityPricing.volume.latest > 0,
)
df_ca_with_data = run_pipeline(pipe_ca_with_data, '2015-01-01', '2016-01-01')
df_ca_with_data.head()
Out[6]:
returns_5d volume
2015-01-02 00:00:00+00:00 Equity(1178883868150594 [KAR]) 0.053459 500.0
Equity(1178884038411085 [RST]) 1.000000 113300.0
Equity(1178888333447512 [ZLU]) -0.009736 45740.0
Equity(1178892628414550 [CET]) 0.032385 10555.0
Equity(1178892676257367 [BAM.PRE]) -0.014571 520.0

Review

  • Domains control the assets and calendar over which a Pipeline executes.
  • Domains can be imported from quantopian.pipeline.domain.
  • Domains are passed to Pipeline objects via the new optional domain parameter.

Generic Datasets

Sharp-eyed users familiar with the Pipeline API may have noticed another new API feature in the previous example: the EquityPricing dataset. Before today, users that wanted pricing data in their pipelines used USEquityPricing, which has open, high, low, close, and volume columns containing daily price/volume summaries. These columns extend naturally to any market for which we have daily pricing data, but the dataset had a US-specific name because we only supported US data when it was created.

We could have created separate datasets for every new market (e.g., CAEquityPricing, JPEquityPricing, etc.), but that would have required creating separate copies of every price-based Factor (like Returns in the example above), even though the business logic for each copy would have been the same. Having separate datasets for each market also would have made it hard to convert a Pipeline from one domain to another, a use-case we wanted to support well. The pattern of having datasets that naturally extend to multiple markets isn't unique to pricing data. The new FactSet Fundamentals dataset, for example, also generalizes naturally over many countries.

On the other hand, some pipeline expressions really only make sense to use on a particular market. Some data vendors only cover specific countries, for example, and some expressions like the QTradableStocksUS have logic that's specific to a particular market. In these cases, it would be confusing at best to support all possible domains.

To solve these problems, we've updated the Pipeline API to distinguish between two kinds of pipeline datasets: generic and specialized.

  • Generic datasets (as well as factors, filters, and classifiers derived from them), can be used on any domain. When we execute a pipeline that depends on a generic dataset, we fill in the appropriate data based on domain of the pipeline.
  • Specialized datasets (as well as derived expressions), are only available on a specific domain. If you run a pipeline that requires a specialized dataset on a different domain, we raise an error explaining the incompatibility.

As of today, there are two generic datasets in the Pipeline API:

  • quantopian.pipeline.data.EquityPricing
  • quantopian.pipeline.data.factset.Fundamentals

Currently, all other datasets, including USEquityPricing, Morningstar Fundamentals, premium datasets, and self-serve datasets are specialized to the US_EQUITIES domain.

Generic datasets make it easy to define a Pipeline that's easily re-usable across multiple domains.

In [7]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import factset, EquityPricing
from quantopian.pipeline.domain import JP_EQUITIES, HK_EQUITIES
from quantopian.research import run_pipeline

def make_pipeline(domain):
    columns = {
        'mcap': factset.Fundamentals.mkt_val.latest,
        'close': EquityPricing.close.latest,
    }
    return Pipeline(columns, domain=domain)


df_jp = run_pipeline(make_pipeline(JP_EQUITIES), '2015-01-15', '2016-01-15')
df_hk = run_pipeline(make_pipeline(HK_EQUITIES), '2015-01-15', '2016-01-15')
In [8]:
df_jp.head()
Out[8]:
close mcap
2015-01-15 00:00:00+00:00 Equity(1178883450164305 [4118]) 699.0 2.065980e+11
Equity(1178883465819716 [7458]) 3255.0 1.711080e+11
Equity(1178883518248535 [9438]) 1680.0 2.699310e+10
Equity(1178883534376010 [3515]) 494.0 2.516390e+09
Equity(1178883585365068 [9471]) 980.0 6.446030e+09
In [9]:
df_hk.head()
Out[9]:
close mcap
2015-01-15 00:00:00+00:00 Equity(1178883566224472 [2689]) 5.85 3.149700e+10
Equity(1178888132186186 [98]) 1.69 6.395390e+08
Equity(1178888198043729 [130]) 1.54 4.549290e+08
Equity(1178892426823492 [2078]) 0.86 NaN
Equity(1178892559276620 [8337]) 0.84 3.579380e+08

Inferring a Domain

In the previous two examples, we defined pipelines with a generic dataset, and we explicitly passed a domain to the pipeline constructor. If we define a pipeline without providing a domain argument, the pipeline execution machinery will attempt to infer a domain from the contents of the pipeline. When a pipeline is defined using only generic datasets, the pipeline will default to the US_EQUITIES domain.

For example, the same pipeline we just ran will default to US_EQUITIES if we don't explicitly provide a domain.

In [10]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import factset, EquityPricing
from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'mcap': factset.Fundamentals.mkt_val.latest,
        'close': EquityPricing.close.latest,
    },
)

df = run_pipeline(pipe, '2015-01-01', '2016-01-01')
df.head()
Out[10]:
close mcap
2015-01-02 00:00:00+00:00 Equity(2 [ARNC]) 15.80 1.894870e+10
Equity(21 [AAME]) 4.03 8.214010e+07
Equity(24 [AAPL]) 110.43 5.910160e+11
Equity(25 [ARNC_PR]) 86.35 NaN
Equity(31 [ABAX]) 56.73 1.142140e+09

In some cases, we may want to define a pipeline that uses a specialized dataset, like Psychsignal. Psychsignal is specialized to the US_EQUITIES domain because it only provides a signal for US equities (at least, it only provides it for US equities in the Quantopian integration).

If we define a pipeline with a specialized dataset, the pipeline object will infer its domain based on the domain of that specialized dataset. In practice, that means the pipeline will default to the US_EQUITIES domain since the Pipeline API currently only has generic datsets and datasets specialized to the US_EQUITIES domain.

The example below runs a pipeline with a Psychsignal dataset. Since the pipeline infers the domain from the Psychsignal dataset, we don't need to provide a domain argument.

In [11]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.psychsignal import stocktwits
from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'bull_scored_messages': stocktwits.bull_scored_messages.latest
    },
)

df = run_pipeline(pipe, '2015-01-01', '2016-01-01')
df.head()
Out[11]:
bull_scored_messages
2015-01-02 00:00:00+00:00 Equity(2 [ARNC]) 2.0
Equity(21 [AAME]) 0.0
Equity(24 [AAPL]) 45.0
Equity(25 [ARNC_PR]) NaN
Equity(31 [ABAX]) 0.0

We can add a generic dataset and our pipeline will still default to the domain of the specialized dataset.

In [12]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import factset
from quantopian.pipeline.data.psychsignal import stocktwits
from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'mcap': factset.Fundamentals.mkt_val.latest,
        'bull_scored_messages': stocktwits.bull_scored_messages.latest
    },
)

df = run_pipeline(pipe, '2015-01-01', '2016-01-01')
df.head()
Out[12]:
bull_scored_messages mcap
2015-01-02 00:00:00+00:00 Equity(2 [ARNC]) 2.0 1.894870e+10
Equity(21 [AAME]) 0.0 8.214010e+07
Equity(24 [AAPL]) 45.0 5.910160e+11
Equity(25 [ARNC_PR]) NaN NaN
Equity(31 [ABAX]) 0.0 1.142140e+09

However, if we try to explicitly define the domain of the pipeline to CA_EQUITIES, we get an error.

In [13]:
# NOTE: This cell is expected to raise an error!
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data import factset
from quantopian.pipeline.data.psychsignal import stocktwits
from quantopian.pipeline.domain import CA_EQUITIES
from quantopian.research import run_pipeline

pipe = Pipeline(
    columns={
        'mcap': factset.Fundamentals.mkt_val.latest,
        'bull_scored_messages': stocktwits.bull_scored_messages.latest
    },
    domain=CA_EQUITIES
)

df = run_pipeline(pipe, '2015-01-01', '2016-01-01')
df.head()

ValueErrorTraceback (most recent call last)
<ipython-input-13-1369dc028d81> in <module>()
     14 )
     15 
---> 16 df = run_pipeline(pipe, '2015-01-01', '2016-01-01')
     17 df.head()

/build/src/qexec_repo/qexec/research/api.py in run_pipeline(pipeline, start_date, end_date, chunksize)
    499             pipeline_engine,
    500             equity_trading_days,
--> 501             holdout_manager,
    502         )
    503 

/build/src/qexec_repo/qexec/research/_api.pyc in inner_run_pipeline(pipeline, start_date, end_date, chunksize, engine, equity_trading_days, holdout_manager)
    736     adjusted_end_date = adjust_date(end_date)
    737 
--> 738     holdout_manager.validate(pipeline, adjusted_end_date)
    739 
    740     return engine.run_chunked_pipeline(

/build/src/qexec_repo/qexec/holdouts.pyc in validate(self, pipeline, end_date)
    106 
    107         country_code = \
--> 108             pipeline.domain(default=self.default_domain).country_code
    109 
    110         errors = defaultdict(set)

/build/src/qexec_repo/zipline_repo/zipline/pipeline/pipeline.py in domain(self, default)
    269     def domain(self, default):
    270         """
--> 271         Get the domain for this pipeline.
    272 
    273         - If an explicit domain was provided at construction time, use it.

/build/src/qexec_repo/zipline_repo/zipline/pipeline/pipeline.pyc in domain(self, default)
    310                 raise ValueError(
    311                     "Conflicting domains in Pipeline. Inferred {}, but {} was "
--> 312                     "passed at construction.".format(inferred, self._domain)
    313                 )
    314             return inferred

ValueError: Conflicting domains in Pipeline. Inferred EquityCalendarDomain('US', 'XNYS'), but EquityCalendarDomain('CA', 'XTSE') was passed at construction.

This error message indicates that we tried to provide a domain to our pipeline that is different from the inferred domain.

Note: In backtesting, the domain of a pipeline is always inferred from the trading calendar. Currently, only the US_EQUITIES domain is supported in the Pipeline API in backtesting.

Conclusion

In this notebook, we:

  • Reminded ourselves of the problem that the Pipeline API solves, and limitations that existed prior to domains.
  • Introduced the concept of domains and learned how they extend the Pipeline API to provide more general functionality.
  • Demonstrated the impact of choosing a domain on the inputs to a pipeline.
  • Introduced 'generic' datasets and learned how the Pipeline API infers domains based on the domain(s) of its input datasets.

Currently, this notebook is the best reference material on domains. We are still working on writing official documentation. We will make an announcement in the forums when more reference material becomes available.