talib
's function signatures are a bit awkward to use in Pipeline right now. There are three major issues with them that I'm aware of:
TA-Lib functions expect to be passed 1D arrays of data on which to perform rolling computations. Its authors expect a usage pattern like this:
prices = get_pricing('MSFT', fields=['high', 'low', 'close_price'], start_date='2014', end_date='2014-03')
prices.head()
The result will be a 1-D array with 13 leading NaNs.
from talib import WILLR
msft_willr = WILLR(prices.high.values, prices.low.values, prices.close_price.values, timeperiod=14)
msft_willr
from matplotlib.pyplot import subplots
# Create a figure with a 2 x 1 stack of subplots.
figure, (top, bottom) = subplots(2, 1, sharex=True)
# Write our computed WILLR values into the top plot.
top.plot(prices.index, msft_willr, color='purple')
top.set_ylabel("Williams' %R")
top.set_title('MSFT Momentum')
# Tell pandas to write our DataFrame values into the bottom plot.
prices.plot(ax=bottom).set_ylabel('US Dollars')
If we want to use this in Pipeline, we have to jump through a few hoops:
timeperiod
and then extract the last entry. (I haven't looked much into the underlying TA-Lib implementations to see whether this is a significant performance hit. At the very least, we're allocating a larger output buffer than we need.)from numpy import nan, isnan
from quantopian.pipeline import Pipeline, CustomFactor
from quantopian.pipeline.data.builtin import USEquityPricing as USEP
from quantopian.research import run_pipeline
def columnwise_anynan(array2d):
# isnan will be broadcasted over the array to produce a 2D array of bools.
# array.any(axis=0) gives us a 1D array whose length is equal to the
# number of columns in the array.
return isnan(array2d).any(axis=0)
class WILLRFactor(CustomFactor):
inputs = [USEP.high, USEP.low, USEP.close]
window_length = 14
def compute(self, today, assets, out, high, low, close):
"""
Compute WILLR on each column of high, low, and close.
"""
# Assume that a nan in high implies a nan in low or close.
# If we had datasets from different sources, we'd probably
# want to do something like:
# columnwise_anynan(high) | columnwise_anynan(low) | columnwise_anynan(close).
anynan = columnwise_anynan(high)
# In general, it's a bad practice to iterate over numpy arrays like this in pure
# python. Unfortunately, TALib doesn't provide us with an API to vectorize
# operations over 2D arrays, so we're stuck with doing this.
# A nice improvement to Zipline would be to provide a module that does this
# efficiently in Cython.
for col_ix, have_nans in enumerate(anynan):
# If we have nans in the input (e.g., because an asset didn't trade for a
# full day, or because the asset hasn't existed for 14 days), just forward
# the NaN.
if have_nans:
out[col_ix] = nan
continue
# Compute our actual WILLR value.
# The [:, ix] syntax here is telling Numpy to slice along the second dimension.
# Just doing array[ix] would give us rows instead of columns
results = WILLR(
high[:, col_ix],
low[:, col_ix],
close[:, col_ix],
timeperiod=self.window_length
)
# Results is a length 14 array containing 13 leading NaNs and then the actual value
# we care about. Needless to say, this is less efficient than it could be.
out[col_ix] = results[-1]
willr = WILLRFactor()
p = Pipeline(
columns={
'willr': willr,
'latest_close': USEP.close.latest,
'latest_high': USEP.high.latest,
'latest_low': USEP.low.latest
},
screen=willr.notnan(),
)
result = run_pipeline(p, '2014', '2014-03')
run_pipeline
gives us a hierarchically-indexed DataFrame¶result
Note: The values output here will be shifted forward one day from the values produced via the get_pricing
method. This is because the values in Pipeline are date-labelled based on the best-known value as of the morning of the date. Thus, on day N, the best known open/high/close values are the values for day N - 1.
Note: The values produced here are still off from what's produced by the alternative method. In most cases, the difference is small, but in some cases it's as much as 20%. I think this is happening because the formula for Williams %R is
(Highest High - Close)/(Highest High - Lowest Low) * -100
In the case that the numerator and the denominator are both small, this becomes very sensitive to small differences in floating-point rounding behavior. (Though even accounting for that, the differences seend below seem greater than I'd expect.)
MSFT = symbols('MSFT')
msft_result = result.xs(MSFT, level=1)
msft_result