pearsonr
, spearmanr
and linear_regression
have been added to Factors. These methods allow you to compute correlations or run regressions between the columns of two different terms, allowing for more generic operations than is currently supported with the RollingPearsonOfReturns
, RollingSpearmanOfReturns
and RollingLinearRegressionOfReturns
built-ins. For more information on these methods check out the docs: https://www.quantopian.com/help#quantopian_pipeline_factors_Factor_pearsonr.from pandas import DataFrame, date_range
from quantopian.pipeline import CustomFactor, Pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import AverageDollarVolume, Returns, SimpleMovingAverage
from quantopian.research import run_pipeline
# Simple slicing example -- first create a Returns factor, then extract the single column corresponding
# to AAPL. This creates a "Slice" object (called `returns_aapl` here) which is used as an input to a
# custom factor.
returns = Returns(window_length=30)
returns_aapl = returns[symbols(24)]
class UsesReturns(CustomFactor):
window_length = 5
inputs = [returns, returns_aapl]
def compute(self, today, assets, out, returns, returns_aapl):
# Print the shape of each input. Our AAPL slice `returns_aapl` should have a shape of (5, 1)
# because our window length is 5, and by definition we only have 1 column.
print 'Returns {0}:\n{1}'.format(returns.shape, returns)
print '\n'
print 'AAPL Returns Slice {0}:\n{1}'.format(returns_aapl.shape, returns_aapl)
pipe = Pipeline(columns={'uses_returns' : UsesReturns()})
run_pipeline(pipe, '2016-06-01', '2016-06-01');
# Slices are unaffected by masking -- when passing a mask to a custom factor, slice inputs are unaffected,
# even if the asset with which the slice is associated would be filtered out.
adv = AverageDollarVolume(window_length=30)
no_aapl = adv.percentile_between(90, 95)
pipe = Pipeline(columns={'uses_returns' : UsesReturns(mask=no_aapl)})
run_pipeline(pipe, '2016-06-01', '2016-06-01');
This is not true when masking the actual Factor from which the Slice is taken. That is:
returns = Returns(window_length=30, mask=no_aapl)
returns_aapl = returns[symbols(24)]
will result in returns_aapl
being all NaNs.
One might have noticed that Returns
is being used as an input to a custom factor. Previously this was not allowed for any factors, but is now allowed for a select few factors deemed safe for use as inputs. This includes Returns
and any factors created from rank
or zscore
. The main reason that these factors can be used as inputs is that they are comparable across splits. Returns
, rank
and zscore
produce normalized values, meaning that they can be meaningfully compared in any context.
Something like SimpleMovingAverage could not be used as an input because of the possibility that a lookback window contains a split. If SimpleMovingAverage were to be used as an input, its computed output would not be adjusted, unlike BoundColumn
terms such as USEquityPricing.close
which are adjusted for splits. Inputs need to be comparable across splits because if for example you wanted to compute correlations using AAPL's simple moving average in June of 2014, all calculations overlapping with the split would be distorted and meaningless. Because of this, most factors still cannot be used as inputs, including any custom factors.
# Here is AAPL's 30-day SMA in June, 2014. One can imagine that attempting to compute the correlation
# between this timeseries and another would produce nonsensical results.
sma = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30)
pipe = Pipeline(columns={'sma' : sma})
results = run_pipeline(pipe, '2014-05-20', '2014-06-25')
results.sma.unstack()[symbols(24)]
# Attempting to add a SimpleMovingAverage factor as an input will fail with a `NonWindowSafeInput` error.
class UsesInvalidInput(CustomFactor):
window_length = 5
inputs = [sma]
def compute(self, today, assets, out, sma):
pass
# This will fail.
UsesInvalidInput()
Slices inherit this "window safe" property, so only Slices of Returns
, rank
and zscore
can currently be used as inputs. This means that:
sma_slice = sma[symbols(24)]
UsesInvalidInput(inputs=[sma_slice])
will also fail.
# Furthermore, Slices cannot be added to a pipeline. Attempts will fail with an
# `UnsupportedPipelineOutput` error.
sma_slice = sma[symbols(24)]
pipe = Pipeline(columns={'aapl_sma' : sma_slice}) # This will fail.
The output of Slices would not fit the multi-index format of a normal pipeline output. Slices output a single value per day corresponding to the asset with which that Slice is associated, but the current infrastructure requires a value for every asset on every day. While a potential solution would be to fill in all other assets with missing values (e.g. NaN), this would detract from the benefits and ease-of-use of Slices.
Macroeconomic datasets suffer a similar dilemma in that they output a single value per day which is unassociated with any particular assets (see here for details: https://www.quantopian.com/posts/upcoming-changes-to-quandl-datasets-in-pipeline-vix-vxv-etc-dot). Ideally we would like to add support for single-value-output terms, such as Slices and VIX, but as of now there is unfortunately no good way to output these terms as pipeline columns.
# This is how a Slice of Returns of an arbitrary asset might look if they could be added as a pipeline
# column. Instead of a multi-index DataFrame with dates and assets, we only need an index of dates
# with a single value per day.
DataFrame(['some value'] * 10, columns=['Returns'], index=date_range('2016-06-01','2016-06-10'))
At the moment, there are quite a few restrictions with how Slices can be used. However, one highly valuable benefit of Slices is their use in computing correlations and regressions between factors. If a Slice is window safe (input safe), we can easily compute the correlation between it and the columns of another factor. For this purpose we have introduced three new Factor methods: pearsonr
, spearmanr
and linear_regression
. The pearsonr
method takes a target
, a correlation_length
and an optional mask
. The target
parameter can be another factor (i.e. an ordinary 2D factor), a Slice of another factor, or a BoundColumn
term. In any case, both the factor calling pearsonr
and the target
parameter must be "input safe" (which again for factors is currently limited to Returns
, zscore
and rank
). However, target
may also be a 1D dataset (including VIX and other macroeconomic indicators), or an ordinary 2D dataset such as sentiment values.
The spearmanr
method takes the same arguments as pearsonr
, and the linear_regression
method also takes the same arguments except it uses regression_length
instead of correlation_length
. These new methods are designed to be more flexible than the aforementioned built-ins (RollingPearsonOfReturns
, RollingSpearmanOfReturns
and RollingLinearRegressionOfReturns
) which were released a few months ago.
# `Factor.pearsonr` takes a target term, in this case `returns_aapl`, and uses it to compute rolling
# pearson correlation coefficients with the columns of another factor, in this case `returns`. This
# example computes the correlation between each stock and SPY (the market) over 100-day look backs of
# Returns. It is recommended that a mask be used when using this method as computations over every asset
# is expensive.
# NOTE: This is equivalent to doing:
# returns_corr = RollingPearsonOfReturns(
# target=symbols(8554),
# returns_length=30,
# correlation_length=100,
# mask=adv.top(500),
# )
returns = Returns(window_length=30)
returns_spy = returns[symbols(8554)] # Creates `Slice` object of SPY's returns.
returns_corr = returns.pearsonr(
target=returns_spy,
correlation_length=100,
mask=adv.top(500),
)
pipe = Pipeline(columns={'returns_corr' : returns_corr})
results = run_pipeline(pipe, '2016-06-01', '2016-06-15')
results.returns_corr.unstack().dropna(axis=1)
# Similarly, we can compute linear regressions of the returns of every asset against the returns of a
# single asset. Notice that this method returns a multi-output factor, so we have to access each output
# as its own individual factor.
# NOTE: This is equivalent to doing:
# returns_corr = RollingLinearRegressionOfReturns(
# target=symbols(8554),
# returns_length=30,
# regression_length=100,
# mask=adv.top(500),
# )
returns_regr = returns.linear_regression(
target=returns_spy,
regression_length=100,
mask=adv.top(500),
)
alpha = returns_regr.alpha
beta = returns_regr.beta
corr = returns_regr.r_value
pipe = Pipeline(columns={'alpha' : alpha, 'beta': beta, 'correlation': corr})
results = run_pipeline(pipe, '2016-06-01', '2016-06-15')
# Print the results of the `beta` factor, dropping any assets with NaNs (assets that were masked out on
# each date are filled with NaNs).
results.beta.unstack().dropna(axis=1)
# Here is another example, this time using two different terms. This is computing, on each day, the
# spearman rank correlation between each stock's 30 day returns over the previous 100 days with the
# previous 100 days of VIX. Note that VIX behaves like a Slice in that each lookback window of VIX is
# a single column of data.
from quantopian.pipeline.data.quandl import yahoo_index_vix as vix
returns = Returns(window_length=30)
returns_vix_corr = returns.spearmanr(
target=vix.close,
correlation_length=100,
mask=adv.top(500),
)
pipe = Pipeline(columns={'returns_vix_correlation' : returns_vix_corr})
results = run_pipeline(pipe, '2016-06-01', '2016-06-15')
results.returns_vix_correlation.unstack().dropna(axis=1)
# Finally, here is an example of passing another factor, instead of a Slice, as the target to the
# `pearsonr` method. In this case, correlations are computed asset-wise. That is, if our base factor
# is `Returns` and our target term is sentiment data, then for each asset we are calculating the
# correlation between that asset's returns over the past `correlation_length` days and that asset's
# sentiment data over the past `correlation_length` days.
from quantopian.pipeline.data.sentdex import sentiment
returns = Returns(window_length=30)
returns_sent_corr = returns.pearsonr(
target=sentiment.sentiment_signal,
correlation_length=100,
mask=adv.top(500),
)
pipe = Pipeline(columns={'returns_sentiment_correlation' : returns_sent_corr})
results = run_pipeline(pipe, '2016-06-01', '2016-06-15')
results.returns_sentiment_correlation.unstack().dropna(axis=1)
In the future we would like to consider supporting a number of additions, including: