Notebook

Normalization and Classifiers

Earlier this week we shipped several new Pipeline API features that combine to make it significantly easier to perform grouped operations on Factors. The most common use for the new functionality will likely be construction of sector-normalized Factors, but the new additions to the Pipeline toolbox have far broader application.

There are two important new additions to the Pipeline API: Factor normalization methods, and a Classifier expression type. It's easiest to understand the value of Classifiers after seeing a simple normalization example.

Normalization Methods: demean and zscore

Many Factors produce results that are not directly comparable with the results of other Factors. A technical indicator like RSI might produce an output bounded between 1 and 100, whereas a fundamental ratio might produce a values of any real number. When we want to incorporate multiple incommensurable factors into a single model, it's often helpful to apply a normalization step on the Factor outputs to make direct comparisons more meaningful.

The first major feature in this release is the addition of two new methods on the Factor base class: demean and zscore. These methods are designed to facilitate make it easier to normalize the results of Factor computations.

Example: demean

demean() is the simpler of the two new normalization methods. The result of calling demean() on a factor is a new factor that works by first computing the original factor, and then subtracting the mean over all assets from each day's output.

In this example, we compare the results of a 30-Day returns factor, before and after de-meaning.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Returns
from quantopian.research import run_pipeline
In [2]:
def demean_example():
    returns = Returns(window_length=30)
    demeaned_returns = returns.demean()
    # Our pipeline includes the original returns factor, the demeaned version, 
    # and the difference between them.
    return Pipeline(
        columns={
            'vanilla': returns, 
            'demeaned': demeaned_returns,
            'diff': returns - demeaned_returns,
        }
    )
In [3]:
# Since demean() just subtracts the daily mean from 'vanilla' column, we should expect that
# the difference between 'vanilla' and 'demeaned' should be constant within each day.
results0 = run_pipeline(demean_example(), '2014', '2014-03-01')
results0.head()
Out[3]:
demeaned diff vanilla
2014-01-02 00:00:00+00:00 Equity(2 [AA]) 0.152321 0.036717 0.189038
Equity(21 [AAME]) -0.016866 0.036717 0.019851
Equity(24 [AAPL]) 0.045246 0.036717 0.081963
Equity(25 [AA_PR]) -0.046151 0.036717 -0.009434
Equity(31 [ABAX]) 0.092516 0.036717 0.129233
In [4]:
# The value of that constant difference should be the daily mean.
# Note that 0.36717 on day one here matches up with the difference in the frame above.
results0['vanilla'].groupby(level=0).mean().head()
Out[4]:
2014-01-02 00:00:00+00:00    0.036717
2014-01-03 00:00:00+00:00    0.038169
2014-01-06 00:00:00+00:00    0.046453
2014-01-07 00:00:00+00:00    0.036338
2014-01-08 00:00:00+00:00    0.039288
Name: vanilla, dtype: float64
In [5]:
# A nice property of a de-meaned Factor is that it is centered about 0.
results0['demeaned'].groupby(level=0).mean().head()
Out[5]:
2014-01-02 00:00:00+00:00    1.920103e-18
2014-01-03 00:00:00+00:00   -3.160884e-19
2014-01-06 00:00:00+00:00    1.009937e-16
2014-01-07 00:00:00+00:00   -8.745146e-17
2014-01-08 00:00:00+00:00    1.001762e-17
Name: demeaned, dtype: float64

Example : zscore

zscore() is only slightly more complex than demean(). In addition to subtracting the daily mean from each output, it also divides by the daily standard deviation. If an asset has a Z-Score of 1 on a given day for some factor, then the value of that factor was one standard deviation above the daily mean. Similarly, an asset with a Z-Score of -1 was one standard deviation below the daily mean.

The API for zscore is identical to that od demean().

In [6]:
def zscore_example():
    returns = Returns(window_length=30)
    zscored_returns = returns.zscore()
    return Pipeline(
        columns={
            'vanilla': returns,
            'zscored': zscored_returns,
        },
        screen=returns.notnull(),
    )
In [7]:
results1 = run_pipeline(zscore_example(), '2014', '2014-03-01')
results1.head()
Out[7]:
vanilla zscored
2014-01-02 00:00:00+00:00 Equity(2 [AA]) 0.189038 1.115770
Equity(21 [AAME]) 0.019851 -0.123542
Equity(24 [AAPL]) 0.081963 0.331433
Equity(25 [AA_PR]) -0.009434 -0.338058
Equity(31 [ABAX]) 0.129233 0.677688
In [8]:
results1.describe()
Out[8]:
vanilla zscored
count 304199.000000 3.041990e+05
mean 0.030987 -6.153825e-17
std 0.170554 1.000002e+00
min -0.950000 -6.813738e+00
25% -0.026756 -3.453011e-01
50% 0.013514 -1.081289e-01
75% 0.059227 1.713692e-01
max 17.295082 6.567897e+01

Masked Normalization

Often we only want to consider some portion of the full tradeable universe when computing a normalization. It can be useful, for example, to exclude extreme outliers from mean and standard devation computations. We often also want to ignore assets that are hard to trade, either because they're illiquid or they're nonstandard share classes.

Both demean and zscore accept an optional mask argument, which can be passed a Filter. When a Filter is supplied as a mask, we treat all locations where the Filter produced False as though those locations had NaN values in the data being normalized. Since demean and zscore already know how to ignore NaNs, providing a mask has the effect of removing the masked values from our normalization calculation.

Example: Z-Score Returns After Removing Illiquid Assets, Outliers, and Non-Primary Shares

In this example, we compute the same 30-day returns factor. We then Z-Score those returns, ignoring non-primary share classes, stocks with low dollar volume, and stocks with very high or low returns.

In [9]:
from quantopian.pipeline.factors import AverageDollarVolume
from quantopian.pipeline.filters.morningstar import IsPrimaryShare

def masked_zscore_returns_example():
    returns = Returns(window_length=30)
    
    is_liquid = AverageDollarVolume(window_length=30).percentile_between(25, 100)
    is_primary = IsPrimaryShare()
    no_returns_outliers = returns.percentile_between(2, 98)
    base_universe = is_liquid & no_returns_outliers & is_primary
    
    masked_zscored = returns.zscore(mask=base_universe)
    
    return Pipeline(
        columns={'masked_zscored': masked_zscored, 'returns': returns}, 
        screen=masked_zscored.notnull()
    )

results2 = run_pipeline(masked_zscore_returns_example(), '2014', '2014-03-01')
results2.head()
Out[9]:
masked_zscored returns
2014-01-02 00:00:00+00:00 Equity(2 [AA]) 1.532109 0.189038
Equity(24 [AAPL]) 0.342244 0.081963
Equity(31 [ABAX]) 0.867525 0.129233
Equity(52 [ABM]) -0.244158 0.029193
Equity(53 [ABMD]) -0.679725 -0.010004
In [10]:
results2.describe()
Out[10]:
masked_zscored returns
count 1.383310e+05 138331.000000
mean -9.979936e-18 0.026548
std 1.000004e+00 0.103754
min -2.615584e+00 -0.235727
25% -6.367952e-01 -0.038596
50% -1.381797e-01 0.014829
75% 4.771117e-01 0.076733
max 4.032615e+00 0.499259
In [11]:
import seaborn as sns
import matplotlib.pyplot as plt

def zscore_histogram(axis, series, ylabel=None):
    plot = sns.distplot(series, ax=axis, kde=False)
    plot.set_yscale('log')
    plot.grid(False)
    if ylabel:
        plot.set_ylabel(ylabel)
    return plot

fig, plots = plt.subplots(ncols=2, sharey=True)

zscore_histogram(plots[0], results1.zscored, ylabel="# of Assets in Z-Score Range")
zscore_histogram(plots[1], results2.masked_zscored)

sns.despine(fig=fig, top=True, right=True)

When Z-Scoring without masking, the vast majority (note that the above plots are log-scale) of assets have Z-Scores near 0, because a small number of outliers distort the distribution significantly.

After masking, our Z-Scores are much more uniformly distributed.

Classifiers

Another common scenario encountered when working with financial data is the need to transform Factor results based on some method of labelling assets. For example, when comparing assets based on some fundamental ratio, it might make more sense to compare each asset to other assets in the same industry instead of comparing against the full universe of assets. Or we might want to compare across companies of approximately the same size.

The second major feature that's been added in this release is a new core expression type: Classifier. Whereas Factors are expressions producing numerical values, and Filters are expressions producing boolean (True/False) values, Classifiers are expressions that produce labels, which can then be used as grouping keys for another expression.

Both demean() and zscore() accept an optional groupby parameter, which can be passed a Classifier. Providing a groupby causes row-normalizations to be applied on groups of assets that all received the same label from the grouping classifier.

There are currently three ways to construct a Classifier:

  • The .latest attribute of any morningstar column of dtype int64 produces a Classifier. There are currently nine such columns:

    • morningstar.asset_classification.cannaics
    • morningstar.asset_classification.morningstar_economy_sphere_code
    • morningstar.asset_classification.morningstar_industry_code
    • morningstar.asset_classification.morningstar_industry_group_code
    • morningstar.asset_classification.morningstar_sector_code
    • morningstar.asset_classification.naics
    • morningstar.asset_classification.sic
    • morningstar.asset_classification.stock_type
    • morningstar.asset_classification.style_box

    More information on each of these columns can be found in the Fundamentals API Reference.

  • There are two new directly-importable Classifier subclasses:

    • quantopian.pipeline.classifiers.morningstar.Sector
    • quantopian.pipeline.classifiers.morningstar.SuperSector

    These built-in classifiers produce the same output as morningstar_sector_code.latest and morningstar_economy_sphere_code.latest, respectively. However, because we expect these to be the most commonly-used classifier columns, Sector and SuperSector provide a few special facilities beyond what's available from the generic .latest classifiers:

    • Sector and SuperSector have hand-written docstrings that are accessible via __doc__ or the ? magic.
    • Sector and SuperSector provide symbolic names for their labels as class-level attributes. For example, Sector.BASIC_MATERIALS is set to 101, the sector code used by morningstar for companies in the materials space. SuperSector provides similar symbolic names. For example. SuperSector.CYCLICAL is set to 1.
  • There are several new Factor methods that produce classifiers by ranking and bucketing stocks based on quantiles of a Factor. The most general of these methods is Factor.quantiles(), which takes an integer indicating how many buckets to use. For example, if we wanted to group securities into small, medium, and large cap buckets, we could do MarketCap().quantiles(3).

    Factor.quartiles(), Factor.quintiles(), and Factor.deciles() have also been added. These are simple convenience aliases for quantiles(4), quantiles(5), and quantiles(10), respectively.

Example: Built-In vs. Generic Classifiers

In [12]:
from quantopian.pipeline.data.morningstar import asset_classification, valuation
from quantopian.pipeline.classifiers.morningstar import Sector

# These produce the same data, but Sector has symbolic constants and hand-written docs:
sector_generic = asset_classification.morningstar_sector_code
sector_builtin = Sector()
In [13]:
print (
    "Docs for built-in Sector class:\n" + sector_builtin.__doc__
)
print "Symbolic Constants:" 
dir(sector_builtin)[1:12]
Docs for built-in Sector class:

        Classifier that groups assets by Morningstar Sector Code.

        There are 11 possible classifications:

        * 101 - Basic Materials
        * 102 - Consumer Cyclical
        * 103 - Financial Services
        * 104 - Real Estate
        * 205 - Consumer Defensive
        * 206 - Healthcare
        * 207 - Utilities
        * 308 - Communication Services
        * 309 - Energy
        * 310 - Industrials
        * 311 - Technology

        These values are provided as integer constants on the class.

        For more information on morningstar classification codes, see:
        https://www.quantopian.com/help/fundamentals#industry-sector.
        
Symbolic Constants:
Out[13]:
['BASIC_MATERIALS',
 'COMMUNICATION_SERVICES',
 'CONSUMER_CYCLICAL',
 'CONSUMER_DEFENSIVE',
 'ENERGY',
 'FINANCIAL_SERVICES',
 'HEALTHCARE',
 'INDUSTRIALS',
 'REAL_ESTATE',
 'TECHNOLOGY',
 'UTILITIES']
In [14]:
print "The Basic Materials sector code is %d." % Sector.BASIC_MATERIALS
The Basic Materials sector code is 101.
In [15]:
print (
    "Docs for generic sector (this is the docstring that's generated "
    "for all columns):\n" + sector_generic.__doc__
)
sector_generic
Docs for generic sector (this is the docstring that's generated for all columns):

    A column of data that's been concretely bound to a particular dataset.

    Instances of this class are dynamically created upon access to attributes
    of DataSets (for example, USEquityPricing.close is an instance of this
    class).

    Attributes
    ----------
    dtype : numpy.dtype
        The dtype of data produced when this column is loaded.
    latest : zipline.pipeline.data.Factor or zipline.pipeline.data.Filter
        A Filter, Factor, or Classifier computing the most recently known value
        of this column on each date.

        Produces a Filter if self.dtype == ``np.bool_``.
        Produces a Classifier if self.dtype == ``np.int64``
        Otherwise produces a Factor.
    dataset : zipline.pipeline.data.DataSet
        The dataset to which this column is bound.
    name : str
        The name of this column.
    
Out[15]:
asset_classification.morningstar_sector_code::int64

Example: Compute Earnings Yield, Z-Scored by Sector

In this example, we load the most recently-known earnings yield for each asset, and compare the effect of Z-Scoring over the whole universe vs Z-Scoring by sector group.

In [16]:
from quantopian.pipeline.data.morningstar import valuation_ratios

def grouped_earnings_yield_example():
    sector = Sector()
    earning_yield = valuation_ratios.earning_yield.latest
    
    zscored_naive = earning_yield.zscore()
    zscored_grouped = earning_yield.zscore(groupby=sector)
    return Pipeline(
        columns={
            'sector': sector,
            'yield': earning_yield,
            'yield_zscored': zscored_naive,
            'yield_zscored_grouped': zscored_grouped,
        },
        screen=zscored_grouped.notnull(),
    )
    
yields = run_pipeline(grouped_earnings_yield_example(), '2014', '2014-03')
yields.head()
Out[16]:
sector yield yield_zscored yield_zscored_grouped
2014-01-02 00:00:00+00:00 Equity(2 [AA]) 101 0.0284 0.014357 0.175227
Equity(21 [AAME]) 103 0.1111 0.014372 0.202752
Equity(24 [AAPL]) 311 0.0715 0.014365 0.444843
Equity(31 [ABAX]) 206 0.0233 0.014356 0.398354
Equity(39 [DDC]) 101 -0.0037 0.014352 0.158035
In [17]:
yields.describe()
Out[17]:
sector yield yield_zscored yield_zscored_grouped
count 200500.000000 200500.000000 200500.000000 2.005000e+05
mean 201.897531 -2.128470 0.000538 -6.471077e-17
std 93.195758 895.125754 0.998611 1.000002e+00
min 101.000000 -400797.330300 -70.011683 -2.372282e+01
25% 103.000000 -0.028600 0.015968 3.518253e-02
50% 206.000000 0.032100 0.023992 1.437575e-01
75% 310.000000 0.058000 0.160452 2.398622e-01
max 311.000000 32.180800 6.544085 1.807121e+01

One thing that should look immediately suspicious about the output above is that our non-grouped Z-Scores are all very close to each other. This is usually an indication that large outliers are compressing our results by inflating the standard deviation of the data.

Plotting the magnitude of the min and max by sector quickly confirms that we do indeed have some large outliers:

In [18]:
# Slice off the first few dates for visualization
yields_initial = (
    yields['2014-01-02':'2014-01-06']
    .reset_index()
    .rename(columns=dict(level_0='date', level_1='asset'))
)

fig, (max_plot, min_plot) = plt.subplots(2, 1)

# Draw the maximum yield by sector on each date.
sns.barplot(
    x='date', 
    y='yield', 
    hue='sector', 
    data=yields_initial,
    ci=None,
    estimator=np.max,
    log=True,
    ax=max_plot,
    palette="Set3",
)
max_plot.set_ylabel('Maximum Yield by Sector')
max_plot.set_xlabel('Date')
max_plot.set_ylim(1e-1, 1e2);
max_plot.legend(ncol=3, title='Sector Code')

# Draw the minimum yield by sector on each date.
plot = sns.barplot(
    x='date', 
    y='yield', 
    hue='sector', 
    data=yields_initial,
    ci=None,
    estimator=lambda arr: abs(np.min(arr)),
    log=True,
    palette="Set3",
)
min_plot.set_ylabel('Minimum Yield (Magnitude) by Sector')
min_plot.set_xlabel('Date')
min_plot.set_ylim(1e-1, 1e7);
min_plot.legend(ncol=3, title='Sector Code');

Note that the high bars in the above are actually much more extreme than they appear, since they're plotted on a log scale!

We can better see the flattening effect of Z-Scoring with outliers, as well as the effect of normalizing by sector, by plotting the difference between the max and min values for each sector:

In [19]:
def plot_range_by_sector(data, ycol, axis, ylimits, log_scale):
    """
    Generate a bar-chart of data[ycol], with a bar for each unique value in data['sector'].
    
    The height of each bar is the difference of the sector max minus the sector min each day.
    """
    plot = sns.barplot(
        x='date', 
        y=ycol, 
        hue='sector', 
        data=data, 
        ax=axis,
        ci=None,
        estimator=lambda row: abs(np.max(row) - np.min(row)),
        palette="Set3",
    )
    plot.set_ylim(*ylimits)
    if log_scale:
        plot.set_yscale('log')
    plot.set_ylabel("(Max - Min): " + ycol)
    plot.legend(ncol=3, title='Sector Code')
    return plot
In [20]:
fig, plots = plt.subplots(3, 1, figsize=(14, 20))

plot_range_by_sector(
    yields_initial, 
    ycol='yield', 
    axis=plots[0],
    ylimits=(1e-2, 1e6),
    log_scale=True
)
plot_range_by_sector(
    yields_initial, 
    ycol='yield_zscored',
    ylimits=(0, 80),
    axis=plots[1],
    log_scale=False,
)
plot_range_by_sector(
    yields_initial, 
    ycol='yield_zscored_grouped',
    ylimits=(0, 80),
    axis=plots[2],
    log_scale=False,
    # See note above on why there's a ; here.
);

In the raw yields, we can see that there's an enormous outlier (note that the raw yield chart is log-scale) on day 1 in sector 308 (Communication Services). This outlier single-handedly inflates the standard deviation over the whole distribution enough to compress the min/max values in the ungrouped Z-Scores (the center plot) to almost zero.

In the sector-grouped Z-Scores (the bottom plot), the effects of the large outlier are contained within the sector, which allows the values in other sectors to better reflect the diversity of observed yields.

What the above plot doesn't show, however, is that the distribution of values within sectors with outliers is still compressed considerably. What we'd really like to do is apply our grouped normalization, while also removing extreme values from the distribution.

A crude filter picks out the worst offenders easily:

In [21]:
# Note in particular the (likely erroneous) value of -400797 on day 1 for HMTV.  This is the big
# day 1 outlier in the charts above.
yields[(yields['yield'] < -10) | (yields['yield'] > 10)]
Out[21]:
sector yield yield_zscored yield_zscored_grouped
2014-01-02 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -12.7151 0.012140 -14.922957
Equity(19177 [LEU]) 101 -25.8171 0.009860 -13.666419
Equity(25784 [CPSL]) 101 -14.0763 0.011903 -7.378594
Equity(26422 [HXM]) 102 -15.8587 0.011593 -19.333553
Equity(38290 [AUMN]) 101 -10.6887 0.012492 -5.564353
Equity(39581 [PGRX]) 101 -12.1691 0.012235 -6.357186
Equity(41845 [HMTV]) 308 -400797.3303 -69.728042 -10.677078
Equity(45252 [WPT]) 309 32.1808 0.019952 18.071212
2014-01-03 00:00:00+00:00 Equity(7317 [TBAC]) 102 -10.6487 -18.402264 -11.788670
Equity(15988 [EMIT_F]) 104 -11.5059 -19.891944 -14.993718
Equity(25784 [CPSL]) 101 -14.3532 -24.840109 -11.952047
Equity(26422 [HXM]) 102 -16.1277 -27.923914 -17.904300
Equity(38290 [AUMN]) 101 -11.8883 -20.556496 -9.867293
2014-01-06 00:00:00+00:00 Equity(7317 [TBAC]) 102 -10.6487 -18.400411 -11.788670
Equity(15988 [EMIT_F]) 104 -11.5059 -19.889944 -14.993718
Equity(25784 [CPSL]) 101 -14.3532 -24.837621 -11.952047
Equity(26422 [HXM]) 102 -16.1277 -27.921122 -17.904300
Equity(38290 [AUMN]) 101 -11.8883 -20.554430 -9.867293
2014-01-07 00:00:00+00:00 Equity(7317 [TBAC]) 102 -10.6487 -18.400411 -11.788670
Equity(15988 [EMIT_F]) 104 -11.5059 -19.889944 -14.993718
Equity(25784 [CPSL]) 101 -14.3532 -24.837621 -11.952047
Equity(26422 [HXM]) 102 -16.1277 -27.921122 -17.904300
Equity(38290 [AUMN]) 101 -11.8883 -20.554430 -9.867293
2014-01-08 00:00:00+00:00 Equity(7317 [TBAC]) 102 -10.6487 -18.398826 -11.788670
Equity(15988 [EMIT_F]) 104 -11.5059 -19.888238 -14.993718
Equity(25784 [CPSL]) 101 -14.3532 -24.835514 -11.952047
Equity(26422 [HXM]) 102 -16.1277 -27.918765 -17.904300
Equity(38290 [AUMN]) 101 -11.8883 -20.552671 -9.867293
2014-01-09 00:00:00+00:00 Equity(7317 [TBAC]) 102 -10.6487 -18.396951 -11.778982
Equity(15988 [EMIT_F]) 104 -11.5059 -19.886215 -14.993718
... ... ... ... ... ...
2014-02-24 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -15.2362 -1.312981 -0.232225
Equity(25784 [CPSL]) 101 -12.3763 -1.063073 -12.376852
Equity(26422 [HXM]) 102 -11.9884 -1.029177 -17.315477
Equity(46205 [SFR]) 104 -800.8205 -69.960109 -15.552924
Equity(46271 [DRNA]) 206 -17.2560 -1.489479 -21.570494
2014-02-25 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -15.2362 -1.312852 -0.232225
Equity(25784 [CPSL]) 101 -12.3763 -1.062969 -12.376852
Equity(26422 [HXM]) 102 -11.9884 -1.029077 -17.315474
Equity(46205 [SFR]) 104 -800.8205 -69.952988 -15.552924
Equity(46271 [DRNA]) 206 -17.2560 -1.489331 -21.555570
2014-02-26 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -15.2362 -1.312992 -0.232225
Equity(25784 [CPSL]) 101 -12.3763 -1.063084 -12.358383
Equity(26422 [HXM]) 102 -11.9884 -1.029188 -17.329437
Equity(46205 [SFR]) 104 -800.8205 -69.960118 -15.552924
Equity(46271 [DRNA]) 206 -17.2560 -1.489489 -21.555570
2014-02-27 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -15.2362 -1.312993 -0.232225
Equity(25784 [CPSL]) 101 -12.3763 -1.063085 -12.358383
Equity(26422 [HXM]) 102 -11.9884 -1.029189 -17.330370
Equity(46205 [SFR]) 104 -800.8205 -69.960119 -15.552924
Equity(46271 [DRNA]) 206 -17.2560 -1.489490 -21.555570
2014-02-28 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -15.2362 -1.312581 -0.232225
Equity(25784 [CPSL]) 101 -12.3763 -1.062749 -12.321792
Equity(26422 [HXM]) 102 -11.9884 -1.028864 -17.329224
Equity(46205 [SFR]) 104 -800.8205 -69.938728 -15.552924
Equity(46271 [DRNA]) 206 -17.2560 -1.489024 -21.538254
2014-03-03 00:00:00+00:00 Equity(15988 [EMIT_F]) 104 -15.2362 -1.312444 -0.232225
Equity(25784 [CPSL]) 101 -12.3763 -1.062638 -12.303281
Equity(26422 [HXM]) 102 -11.9884 -1.028755 -17.329171
Equity(46205 [SFR]) 104 -800.8205 -69.931597 -15.552924
Equity(46271 [DRNA]) 206 -17.2560 -1.488869 -21.538254

213 rows × 4 columns

Example: Compute Earnings Yield, Z-Scored by Sector, Excluding Outliers

The mask and groupby parameters can be used together to perform grouped transformations while ignoring undesired values.

In [22]:
def masked_grouped_earnings_yield_example():
    sector = Sector()
    earning_yield = valuation_ratios.earning_yield.latest
    # We could also have done something like earning_yield.percentile_between(1, 99) here.
    without_outliers = (-10 < earning_yield) & (earning_yield < 10)
    
    zscored_masked = earning_yield.zscore(mask=without_outliers)
    zscored_masked_grouped = earning_yield.zscore(mask=without_outliers, groupby=sector)

    return Pipeline(columns={'sector': sector,
                             'yield': earning_yield,
                             'yield_zscored_masked': zscored_masked,
                             'yield_zscored_masked_grouped': zscored_masked_grouped},
                    screen=zscored_masked_grouped.notnull())

masked_yields = run_pipeline(masked_grouped_earnings_yield_example(), '2014', '2014-03')
masked_yields.head()
Out[22]:
sector yield yield_zscored_masked yield_zscored_masked_grouped
2014-01-02 00:00:00+00:00 Equity(2 [AA]) 101 0.0284 0.181894 0.329021
Equity(21 [AAME]) 103 0.1111 0.372054 0.202752
Equity(24 [AAPL]) 311 0.0715 0.280998 0.444843
Equity(31 [ABAX]) 206 0.0233 0.170167 0.398354
Equity(39 [DDC]) 101 -0.0037 0.108083 0.253628
In [23]:
# Note that the min/max on `yield` are now between -10 and 10.
masked_yields.describe()
Out[23]:
sector yield yield_zscored_masked yield_zscored_masked_grouped
count 200287.000000 200287.000000 200287.000000 2.002870e+05
mean 201.991567 -0.040724 0.001117 -1.227580e-16
std 93.193610 0.399704 0.995850 1.000002e+00
min 101.000000 -9.631100 -25.752840 -2.372282e+01
25% 103.000000 -0.028400 0.031100 6.659613e-03
50% 206.000000 0.032100 0.181794 1.620947e-01
75% 310.000000 0.058000 0.249094 2.685622e-01
max 311.000000 6.666400 16.335766 1.477042e+01

Example: Quantile-based Classifiers

The above examples all use built-in classifiers. It's also possible to create a Classifier from any existing Factor via the quantiles method:

In this example, we show how to create a classi

In [24]:
from quantopian.pipeline.data.morningstar import valuation

def quantiles_example():
    market_cap = valuation.market_cap.latest
    market_cap_decile = market_cap.deciles()  # Equivalent to market_cap.quantiles(10)
    
    returns = Returns(window_length=30)
    
    excess_returns = returns.demean(
        mask=returns.percentile_between(1, 99),
        groupby=market_cap_decile,  # Grouping by a computed classifier works as expected.
    )
    
    return Pipeline(
        columns={
            'market_cap': market_cap,
            'market_cap_decile': market_cap_decile,  # Classifiers can be set as output columns.
            'returns': returns,
            'excess_returns': excess_returns,
        },
        screen=excess_returns.notnull(),
    )
In [25]:
quantiles_results = run_pipeline(quantiles_example(), '2014', '2014-03')
quantiles_results.head()
Out[25]:
excess_returns market_cap market_cap_decile returns
2014-01-02 00:00:00+00:00 Equity(2 [AA]) 0.158390 1.027890e+10 8 0.189038
Equity(21 [AAME]) -0.016481 8.480670e+07 1 0.019851
Equity(24 [AAPL]) 0.050370 5.003170e+11 9 0.081963
Equity(31 [ABAX]) 0.077012 8.013110e+08 5 0.129233
Equity(39 [DDC]) 0.019475 1.144070e+09 5 0.071695

With our quantiles classifier, we can confirm the well-known fact that small cap stocks tend to outperform large cap stocks:

In [26]:
returns_by_decile = quantiles_results.groupby(['market_cap_decile'])['returns'].mean()

plot = sns.barplot(
    x=returns_by_decile.index,
    y=returns_by_decile.values,
    color='red',
)
plot.set_title("30-Day Returns by Market Cap Decile", fontsize=16)
sns.despine(plot.figure)

plot.grid(False)
plot.set_xlabel("Market Cap Decile", fontsize=14);
In [29]:
fig, plot = plt.subplots(1, 1, figsize=(14, 8))

colors = sns.color_palette('RdBu_r', 10)
for decile, color in enumerate(colors):
    sns.distplot(
        quantiles_results['excess_returns'][quantiles_results['market_cap_decile'] == decile],
        hist=False,
        color=color,
        label=decile,
        ax=plot,
    )
    
plot.legend(title='Decile', fontsize=14)
plot.set_xlabel('Excess Return over Decile Mean', fontsize=14)
sns.despine(fig=fig)
plot.grid(False)
plot.set_yticklabels([]);
plot.set_title('Distribution of Returns by Market Cap Decile', fontsize=16);

Conclusions and Future Work

Classifiers have been a planned extension to the Pipeline API since the early days of the design. I think they're one of the last missing pieces to supporting many truly sophisticated quant workflows, so I'm excited to finally be able to see what the community builds with these features.

In the coming weeks, we hope to have a slew of additional small improvements to classifiers, including groupby support for Factor.rank(), and support for non-integer columns (especially strings) from the morningstar database.

For more info on the working with Classifiers and normalizations, there's a new Help Docs Section on Normalization, and new API Reference entries for Classifier, the new builtins, Factor.demean, and Factor.zscore.