Earlier this week we shipped several new Pipeline API features that combine to make it significantly easier to perform grouped operations on Factors. The most common use for the new functionality will likely be construction of sector-normalized Factors, but the new additions to the Pipeline toolbox have far broader application.
There are two important new additions to the Pipeline API: Factor
normalization methods, and a Classifier
expression type. It's easiest to understand the value of Classifiers after seeing a simple normalization example.
demean
and zscore
¶Many Factors produce results that are not directly comparable with the results of other Factors. A technical indicator like RSI might produce an output bounded between 1 and 100, whereas a fundamental ratio might produce a values of any real number. When we want to incorporate multiple incommensurable factors into a single model, it's often helpful to apply a normalization step on the Factor outputs to make direct comparisons more meaningful.
The first major feature in this release is the addition of two new methods on the Factor
base class: demean
and zscore
. These methods are designed to facilitate make it easier to normalize the results of Factor computations.
demean
¶demean() is the simpler of the two new normalization methods. The result of calling demean() on a factor is a new factor that works by first computing the original factor, and then subtracting the mean over all assets from each day's output.
In this example, we compare the results of a 30-Day returns factor, before and after de-meaning.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from quantopian.pipeline import Pipeline
from quantopian.pipeline.factors import Returns
from quantopian.research import run_pipeline
def demean_example():
returns = Returns(window_length=30)
demeaned_returns = returns.demean()
# Our pipeline includes the original returns factor, the demeaned version,
# and the difference between them.
return Pipeline(
columns={
'vanilla': returns,
'demeaned': demeaned_returns,
'diff': returns - demeaned_returns,
}
)
# Since demean() just subtracts the daily mean from 'vanilla' column, we should expect that
# the difference between 'vanilla' and 'demeaned' should be constant within each day.
results0 = run_pipeline(demean_example(), '2014', '2014-03-01')
results0.head()
# The value of that constant difference should be the daily mean.
# Note that 0.36717 on day one here matches up with the difference in the frame above.
results0['vanilla'].groupby(level=0).mean().head()
# A nice property of a de-meaned Factor is that it is centered about 0.
results0['demeaned'].groupby(level=0).mean().head()
zscore
¶zscore()
is only slightly more complex than demean()
. In addition to subtracting the daily mean from each output, it also divides by the daily standard deviation. If an asset has a Z-Score of 1 on a given day for some factor, then the value of that factor was one standard deviation above the daily mean. Similarly, an asset with a Z-Score of -1
was one standard deviation below the daily mean.
The API for zscore
is identical to that od demean()
.
def zscore_example():
returns = Returns(window_length=30)
zscored_returns = returns.zscore()
return Pipeline(
columns={
'vanilla': returns,
'zscored': zscored_returns,
},
screen=returns.notnull(),
)
results1 = run_pipeline(zscore_example(), '2014', '2014-03-01')
results1.head()
results1.describe()
Often we only want to consider some portion of the full tradeable universe when computing a normalization. It can be useful, for example, to exclude extreme outliers from mean and standard devation computations. We often also want to ignore assets that are hard to trade, either because they're illiquid or they're nonstandard share classes.
Both demean
and zscore
accept an optional mask
argument, which can be passed a Filter
. When a Filter is supplied as a mask, we treat all locations where the Filter produced False
as though those locations had NaN values in the data being normalized. Since demean
and zscore
already know how to ignore NaNs, providing a mask has the effect of removing the masked values from our normalization calculation.
In this example, we compute the same 30-day returns factor. We then Z-Score those returns, ignoring non-primary share classes, stocks with low dollar volume, and stocks with very high or low returns.
from quantopian.pipeline.factors import AverageDollarVolume
from quantopian.pipeline.filters.morningstar import IsPrimaryShare
def masked_zscore_returns_example():
returns = Returns(window_length=30)
is_liquid = AverageDollarVolume(window_length=30).percentile_between(25, 100)
is_primary = IsPrimaryShare()
no_returns_outliers = returns.percentile_between(2, 98)
base_universe = is_liquid & no_returns_outliers & is_primary
masked_zscored = returns.zscore(mask=base_universe)
return Pipeline(
columns={'masked_zscored': masked_zscored, 'returns': returns},
screen=masked_zscored.notnull()
)
results2 = run_pipeline(masked_zscore_returns_example(), '2014', '2014-03-01')
results2.head()
results2.describe()
import seaborn as sns
import matplotlib.pyplot as plt
def zscore_histogram(axis, series, ylabel=None):
plot = sns.distplot(series, ax=axis, kde=False)
plot.set_yscale('log')
plot.grid(False)
if ylabel:
plot.set_ylabel(ylabel)
return plot
fig, plots = plt.subplots(ncols=2, sharey=True)
zscore_histogram(plots[0], results1.zscored, ylabel="# of Assets in Z-Score Range")
zscore_histogram(plots[1], results2.masked_zscored)
sns.despine(fig=fig, top=True, right=True)
When Z-Scoring without masking, the vast majority (note that the above plots are log-scale) of assets have Z-Scores near 0, because a small number of outliers distort the distribution significantly.
After masking, our Z-Scores are much more uniformly distributed.
Another common scenario encountered when working with financial data is the need to transform Factor results based on some method of labelling assets. For example, when comparing assets based on some fundamental ratio, it might make more sense to compare each asset to other assets in the same industry instead of comparing against the full universe of assets. Or we might want to compare across companies of approximately the same size.
The second major feature that's been added in this release is a new core expression type: Classifier
. Whereas Factor
s are expressions producing numerical values, and Filter
s are expressions producing boolean (True/False) values, Classifier
s are expressions that produce labels, which can then be used as grouping keys for another expression.
Both demean()
and zscore()
accept an optional groupby
parameter, which can be passed a Classifier
. Providing a groupby
causes row-normalizations to be applied on groups of assets that all received the same label from the grouping classifier.
There are currently three ways to construct a Classifier
:
The .latest
attribute of any morningstar
column of dtype int64
produces a Classifier. There are currently nine such columns:
morningstar.asset_classification.cannaics
morningstar.asset_classification.morningstar_economy_sphere_code
morningstar.asset_classification.morningstar_industry_code
morningstar.asset_classification.morningstar_industry_group_code
morningstar.asset_classification.morningstar_sector_code
morningstar.asset_classification.naics
morningstar.asset_classification.sic
morningstar.asset_classification.stock_type
morningstar.asset_classification.style_box
More information on each of these columns can be found in the Fundamentals API Reference.
There are two new directly-importable Classifier
subclasses:
quantopian.pipeline.classifiers.morningstar.Sector
quantopian.pipeline.classifiers.morningstar.SuperSector
These built-in classifiers produce the same output as morningstar_sector_code.latest
and
morningstar_economy_sphere_code.latest
, respectively. However, because we expect these to be the most commonly-used classifier columns, Sector
and SuperSector
provide a few special facilities beyond what's available from the generic .latest
classifiers:
Sector
and SuperSector
have hand-written docstrings that are accessible via __doc__
or the ?
magic.Sector
and SuperSector
provide symbolic names for their labels as class-level attributes.
For example, Sector.BASIC_MATERIALS
is set to 101, the sector code used by morningstar for companies in the materials space. SuperSector
provides similar symbolic names. For example. SuperSector.CYCLICAL
is set to 1.There are several new Factor
methods that produce classifiers by ranking and bucketing stocks based on quantiles of a Factor
. The most general of these methods is Factor.quantiles()
, which takes an integer indicating how many buckets to use. For example, if we wanted to group securities into small, medium, and large cap buckets, we could do MarketCap().quantiles(3)
.
Factor.quartiles()
, Factor.quintiles()
, and Factor.deciles()
have also been added. These are simple convenience aliases for quantiles(4)
, quantiles(5)
, and quantiles(10)
, respectively.
from quantopian.pipeline.data.morningstar import asset_classification, valuation
from quantopian.pipeline.classifiers.morningstar import Sector
# These produce the same data, but Sector has symbolic constants and hand-written docs:
sector_generic = asset_classification.morningstar_sector_code
sector_builtin = Sector()
print (
"Docs for built-in Sector class:\n" + sector_builtin.__doc__
)
print "Symbolic Constants:"
dir(sector_builtin)[1:12]
print "The Basic Materials sector code is %d." % Sector.BASIC_MATERIALS
print (
"Docs for generic sector (this is the docstring that's generated "
"for all columns):\n" + sector_generic.__doc__
)
sector_generic
In this example, we load the most recently-known earnings yield for each asset, and compare the effect of Z-Scoring over the whole universe vs Z-Scoring by sector group.
from quantopian.pipeline.data.morningstar import valuation_ratios
def grouped_earnings_yield_example():
sector = Sector()
earning_yield = valuation_ratios.earning_yield.latest
zscored_naive = earning_yield.zscore()
zscored_grouped = earning_yield.zscore(groupby=sector)
return Pipeline(
columns={
'sector': sector,
'yield': earning_yield,
'yield_zscored': zscored_naive,
'yield_zscored_grouped': zscored_grouped,
},
screen=zscored_grouped.notnull(),
)
yields = run_pipeline(grouped_earnings_yield_example(), '2014', '2014-03')
yields.head()
yields.describe()
One thing that should look immediately suspicious about the output above is that our non-grouped Z-Scores are all very close to each other. This is usually an indication that large outliers are compressing our results by inflating the standard deviation of the data.
Plotting the magnitude of the min and max by sector quickly confirms that we do indeed have some large outliers:
# Slice off the first few dates for visualization
yields_initial = (
yields['2014-01-02':'2014-01-06']
.reset_index()
.rename(columns=dict(level_0='date', level_1='asset'))
)
fig, (max_plot, min_plot) = plt.subplots(2, 1)
# Draw the maximum yield by sector on each date.
sns.barplot(
x='date',
y='yield',
hue='sector',
data=yields_initial,
ci=None,
estimator=np.max,
log=True,
ax=max_plot,
palette="Set3",
)
max_plot.set_ylabel('Maximum Yield by Sector')
max_plot.set_xlabel('Date')
max_plot.set_ylim(1e-1, 1e2);
max_plot.legend(ncol=3, title='Sector Code')
# Draw the minimum yield by sector on each date.
plot = sns.barplot(
x='date',
y='yield',
hue='sector',
data=yields_initial,
ci=None,
estimator=lambda arr: abs(np.min(arr)),
log=True,
palette="Set3",
)
min_plot.set_ylabel('Minimum Yield (Magnitude) by Sector')
min_plot.set_xlabel('Date')
min_plot.set_ylim(1e-1, 1e7);
min_plot.legend(ncol=3, title='Sector Code');
Note that the high bars in the above are actually much more extreme than they appear, since they're plotted on a log scale!
We can better see the flattening effect of Z-Scoring with outliers, as well as the effect of normalizing by sector, by plotting the difference between the max and min values for each sector:
def plot_range_by_sector(data, ycol, axis, ylimits, log_scale):
"""
Generate a bar-chart of data[ycol], with a bar for each unique value in data['sector'].
The height of each bar is the difference of the sector max minus the sector min each day.
"""
plot = sns.barplot(
x='date',
y=ycol,
hue='sector',
data=data,
ax=axis,
ci=None,
estimator=lambda row: abs(np.max(row) - np.min(row)),
palette="Set3",
)
plot.set_ylim(*ylimits)
if log_scale:
plot.set_yscale('log')
plot.set_ylabel("(Max - Min): " + ycol)
plot.legend(ncol=3, title='Sector Code')
return plot
fig, plots = plt.subplots(3, 1, figsize=(14, 20))
plot_range_by_sector(
yields_initial,
ycol='yield',
axis=plots[0],
ylimits=(1e-2, 1e6),
log_scale=True
)
plot_range_by_sector(
yields_initial,
ycol='yield_zscored',
ylimits=(0, 80),
axis=plots[1],
log_scale=False,
)
plot_range_by_sector(
yields_initial,
ycol='yield_zscored_grouped',
ylimits=(0, 80),
axis=plots[2],
log_scale=False,
# See note above on why there's a ; here.
);
In the raw yields, we can see that there's an enormous outlier (note that the raw yield chart is log-scale) on day 1 in sector 308 (Communication Services). This outlier single-handedly inflates the standard deviation over the whole distribution enough to compress the min/max values in the ungrouped Z-Scores (the center plot) to almost zero.
In the sector-grouped Z-Scores (the bottom plot), the effects of the large outlier are contained within the sector, which allows the values in other sectors to better reflect the diversity of observed yields.
What the above plot doesn't show, however, is that the distribution of values within sectors with outliers is still compressed considerably. What we'd really like to do is apply our grouped normalization, while also removing extreme values from the distribution.
A crude filter picks out the worst offenders easily:
# Note in particular the (likely erroneous) value of -400797 on day 1 for HMTV. This is the big
# day 1 outlier in the charts above.
yields[(yields['yield'] < -10) | (yields['yield'] > 10)]
The mask
and groupby
parameters can be used together to perform grouped transformations while ignoring undesired values.
def masked_grouped_earnings_yield_example():
sector = Sector()
earning_yield = valuation_ratios.earning_yield.latest
# We could also have done something like earning_yield.percentile_between(1, 99) here.
without_outliers = (-10 < earning_yield) & (earning_yield < 10)
zscored_masked = earning_yield.zscore(mask=without_outliers)
zscored_masked_grouped = earning_yield.zscore(mask=without_outliers, groupby=sector)
return Pipeline(columns={'sector': sector,
'yield': earning_yield,
'yield_zscored_masked': zscored_masked,
'yield_zscored_masked_grouped': zscored_masked_grouped},
screen=zscored_masked_grouped.notnull())
masked_yields = run_pipeline(masked_grouped_earnings_yield_example(), '2014', '2014-03')
masked_yields.head()
# Note that the min/max on `yield` are now between -10 and 10.
masked_yields.describe()
The above examples all use built-in classifiers. It's also possible to create a Classifier
from any existing Factor
via the quantiles
method:
In this example, we show how to create a classi
from quantopian.pipeline.data.morningstar import valuation
def quantiles_example():
market_cap = valuation.market_cap.latest
market_cap_decile = market_cap.deciles() # Equivalent to market_cap.quantiles(10)
returns = Returns(window_length=30)
excess_returns = returns.demean(
mask=returns.percentile_between(1, 99),
groupby=market_cap_decile, # Grouping by a computed classifier works as expected.
)
return Pipeline(
columns={
'market_cap': market_cap,
'market_cap_decile': market_cap_decile, # Classifiers can be set as output columns.
'returns': returns,
'excess_returns': excess_returns,
},
screen=excess_returns.notnull(),
)
quantiles_results = run_pipeline(quantiles_example(), '2014', '2014-03')
quantiles_results.head()
With our quantiles classifier, we can confirm the well-known fact that small cap stocks tend to outperform large cap stocks:
returns_by_decile = quantiles_results.groupby(['market_cap_decile'])['returns'].mean()
plot = sns.barplot(
x=returns_by_decile.index,
y=returns_by_decile.values,
color='red',
)
plot.set_title("30-Day Returns by Market Cap Decile", fontsize=16)
sns.despine(plot.figure)
plot.grid(False)
plot.set_xlabel("Market Cap Decile", fontsize=14);
fig, plot = plt.subplots(1, 1, figsize=(14, 8))
colors = sns.color_palette('RdBu_r', 10)
for decile, color in enumerate(colors):
sns.distplot(
quantiles_results['excess_returns'][quantiles_results['market_cap_decile'] == decile],
hist=False,
color=color,
label=decile,
ax=plot,
)
plot.legend(title='Decile', fontsize=14)
plot.set_xlabel('Excess Return over Decile Mean', fontsize=14)
sns.despine(fig=fig)
plot.grid(False)
plot.set_yticklabels([]);
plot.set_title('Distribution of Returns by Market Cap Decile', fontsize=16);
Classifiers have been a planned extension to the Pipeline API since the early days of the design. I think they're one of the last missing pieces to supporting many truly sophisticated quant workflows, so I'm excited to finally be able to see what the community builds with these features.
In the coming weeks, we hope to have a slew of additional small improvements to classifiers, including groupby
support for Factor.rank()
, and support for non-integer columns (especially strings) from the morningstar database.
For more info on the working with Classifiers and normalizations, there's a new Help Docs Section on Normalization, and new API Reference entries for Classifier, the new builtins, Factor.demean, and Factor.zscore.