By Nathan Wolfe
This research is published in partnership with Quantpedia, an online resource for discovering new trading ideas.
You can view the full Quantpedia series in the library along with other research and strategies.
Whitepaper authors: Eric C. So (ESo@mit.edu), Sean Wang
Whitepaper source: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2275982
From Quantpedia:
> In general, reversal in price of an asset occurs due to investors' overreaction to asset-related news and the subsequent price correction. In this case, the most probable reason for the phenomenon, according to the authors, is the market makers‘ aversion to inventory risks that tend to increase dramatically in the pre-announcement period. Consequently, the market makers demand higher compensation for providing liquidity due to higher risk and therefore raise prices, which are expected to reverse after the earnings announcement.
As the paper does, I find evidence of returns reversal during earnings announcements; while the paper tested using data from 1996 to 2011, I used data from 2007 to 2016. The average reversal among all stocks in my data is 0.449%, compared to a result of 1.448% in the paper. I found that we can reasonably increase the reversal to 0.6% by selecting firms based on a minimum average dollar volume percentile, or based on a minimum market cap.
>Several market frictions have the potential to significantly impact the efficiency and information content of market prices. This study focuses on the friction that arises from the need to locate a counterparty in order to complete a trade. Market makers typically mitigate this friction by matching would-be sellers with would-be buyers. When there is an imbalance between the quantities sought by buyers and sellers at a given price, market makers may absorb the order imbalance into their own account by serving as the trade counterparty. This practice is commonly known as liquidity provision.
Upon anticipation of a good (bad) earnings report, investors want to go long (short) in the stock; they often don't consider whether the stock is currently over- or under-priced, simply assuming that it will get a positive (negative) bump upon the release of the earnings report. Meanwhile, market makers see an imbalance in demand for long (short) positions; they thus raise (lower) the price to compensate for providing the liquidity necessary to fill the imbalance. Investors usually don't care about this price change, assuming the bump will come from the report.
After the earnings announcement, the demand imbalance should disappear, and with it the market makers' need for the price adjustment. Thus the market makers reverse the recent price change which causes a short-term reversal.
Our strategy will be to assume that new information provided by the release of the earnings report will be neutral on average. All that remains is to take advantage of the market makers' adjustment of the stock price, by taking an opposite position and waiting for them to reverse the change.
>We show that a long (short) position in firms whose returns strongly underperform (outperform) the market in the three days prior to earnings announcements yields an average return of 145 basis points (bps) during the announcement window. By comparison, the average return to a comparable portfolio during non-announcement periods is 22 bps, indicating that return reversals increase more than six-fold during earnings announcements.
While I didn't compare to non-announcement conditions, I did find clear low/high reversals for the best/worst performing securities prior to earnings. I also examined how various universe selection conditions would affect the returns of this strategy; while the authors selected based on market cap, I explored using average dollar volume as well as our Q500US and Q1500US liquid stock universes finding that a minimum average dollar volume filter and Q500US universe both performed well as universe selectors.
You can navigate the rest of this notebook as follows:
# Basic Variables
END = pd.Timestamp.now('US/Eastern').replace(tzinfo=utc) - pd.Timedelta('7d')
# For the free set, comment the above line and uncomment the below:
#END = pd.Timestamp.now('US/Eastern').replace(tzinfo=utc) - pd.Timedelta('731d')
START = pd.Timestamp('2007-01-01', tz=utc)
BENCHMARK = symbols('SPY')
RETURNS_QUANTILES = 5
PRICE_WINDOW = 8
MAX_DAYS_TO_HOLD = 20
PRICE_DAY_OFFSETS = range(-1, MAX_DAYS_TO_HOLD + 1)
# plot colors
color = '#0062AE'
The methodology behind the study is based on the idea that a stock's short-term returns will reverse during an earnings announcement as an inbalance in demand dissipates. The data used in this Research Notebook is sourced from EventVestor's Earnings Calendar Dataset. The sample version is available from 2007 up till 2 years ago while the premium version is available up till present day.
Below, a Pipeline pulls all earnings announcements of which prior knowledge of the date existed, along with data on the day before each about the company's average dollar volume, market cap, sector, and short-term returns.
I eliminate firms with prices below $5 to mitigate the influence of bid-ask bounce on our calculation of return reversals as noted in the paper. Our data starts on 2007-01-01 and spans up through the present day (2 years ago for the free set).
Definition: "PAR" stands for "Pre-Announcement Returns," referring to the stocks' short-term returns prior to their earnings announcements.
# Basic universe filters
tradable_filter = IsPrimaryShare() & mstar.valuation.market_cap.latest.notnull()
days_filter = BusinessDaysUntilNextEarnings().eq(1)
price_filter = USEquityPricing.close.latest > 5
# Factors for returns and liquidity
adv = AverageDollarVolume(mask=USEquityPricing.volume.latest > 0, window_length=30)
vlr = variable_lookback_returns(10, mask=adv.notnan())
market_cap = mstar.valuation.market_cap.latest
# Pipeline itself to calculate PAR
pipe = Pipeline(columns={'adv_percentile': adv.quantiles(100),
'sector': mstar.asset_classification.morningstar_sector_code.latest,
'q500us': Q500US(),
'q1500us': Q1500US(),
'market_cap': market_cap,
'market_cap_percentile': market_cap.quantiles(100),
# -3 to -2
'par_2': vlr.par_2.quantiles(RETURNS_QUANTILES),
# Used by the authors in original study: -4 to -2
'par_3': vlr.par_3.quantiles(RETURNS_QUANTILES),
# -5 to - 2
'par_4': vlr.par_4.quantiles(RETURNS_QUANTILES),
# -6 to -2
'par_5': vlr.par_5.quantiles(RETURNS_QUANTILES),
# -7 to -2
'par_6': vlr.par_6.quantiles(RETURNS_QUANTILES),
# -8 to -2
'par_7': vlr.par_7.quantiles(RETURNS_QUANTILES),
# -9 to -2
'par_8': vlr.par_8.quantiles(RETURNS_QUANTILES),
# -10 to -2
'par_9': vlr.par_9.quantiles(RETURNS_QUANTILES),
# -11 to -2
'par_10': vlr.par_10.quantiles(RETURNS_QUANTILES),
},
screen=days_filter & tradable_filter & price_filter & (vlr.par_10.notnan()))
data = split_run_pipeline(pipe, START, END, 16)
# Get before/ after pricing data for all announcement events using `get_pricing`.
cal = tradingcalendar.get_trading_days(START - pd.Timedelta('20d'),
END + pd.Timedelta(days=MAX_DAYS_TO_HOLD * 2))
price_data = data.groupby(level=0).apply(fill_prices)
price_data_matched = price_data.reindex(data.index)
print 'Done fetching data.'
print "This sample consists of %s earnings announcements spanning %s to %s." % (len(data),
START.year, END.year)
> The extreme quintiles of PAR (i.e., quintiles Q5 and Q1) consist of firms that are generally smaller, possess lower book-to-market ratios and share prices, and have higher volatility and relative spreads.
Similar to the original whitepaper, I find that the lowest and highest quantiles are generally lower in market cap.
raw_data = data.reset_index()
raw_data.groupby("par_3").mean()[['market_cap', 'adv_percentile', 'market_cap_percentile']].iloc[1:]
This section documents the reversal effect during earnings by examining the spreads between the worst and best performing stocks by PAR returns quantile (0 vs 4 respectively). I find a final spread of 0.449% compared to the authors' final value of 1.448%.
The expected returns will be the first quantile minus the returns on the last quantile. This difference I will call the "spread," and will use "spread" and "returns" interchangeably.
Below is a comparison average of t-1 to t+1 returns when all equities are divided into quantiles based on their returns in the 3 days prior, as suggested by the paper. It looks like we do have some spread between the first and last quantiles.
baskets = np.zeros(RETURNS_QUANTILES)
for i in range(RETURNS_QUANTILES):
baskets[i] = returns(2, data['par_3'] == i).mean()
plt.bar(range(RETURNS_QUANTILES), baskets, color=color, alpha=.8)
plt.xlabel('t-4 to t-2 returns quantile')
plt.ylabel('average t-1 to t+1 returns')
plt.title('Average returns during earnings announcement by PAR quantile')
print 'Spread: %f' % calc_spread(3, 2)
The strategy that the paper suggests is to go long in the first quantile and short in the last quantile for a market-neutral strategy. Given the spread here, it looks like that approach may be fruitful. Compare our spread of 0.449% with the paper's final value of 1.448%. Perhaps the alpha has decayed since 2011, but there are still optimizations which might improve our returns.
The main point here is that the firms with the highest short-term returns prior to the announcement have the lowest short-term returns following their announcement, and vice-versa. This is illustrated by the following plot; note how the returns curves seem to bounce backward from their trend from right before the announcement.
means = pd.DataFrame(columns=range(RETURNS_QUANTILES), dtype=float)
for q in means.columns:
subset = price_data_matched.loc[data['par_3'] == q]
means[q] = (subset.transpose() / subset[0]).loc[-1:].mean(axis=1)
means.plot()
plt.xlabel('days since t-1')
plt.ylabel('mean price, normalized at t-1')
plt.title('Average cumulative returns over time by PAR quantile');
Up till now, my research supports that there is evidence of reversals in the best and worst performing stocks prior to an earnings announcement. This section is dedicated to examining the consistency of these results through a number of dimensions: Liquidity, Sector, and Liquidity + Sector.
years = range(START.year, END.year + 1)
baskets = pd.Series(index=years)
for y in years:
baskets[y] = calc_spread(5, 2, data.index.get_level_values(0).year == y)
plt.bar(years, baskets, alpha=.9, color=color)
plt.xlabel('year')
plt.ylabel('average spread')
plt.title('Returns by year');
The consistency is variable as you'd expect from an events based strategy. Additionally, as noted in the main study section, there appears to be some alpha decay in the returns over time. Consistency is quantified by dividing the mean spread (returns) by the standard deviation of the yearly spreads as above which gives a primitive Shapre ratio.
years = range(START.year, END.year + 1)
adv_sharpes = pd.Series(index=range(100))
for adv in adv_sharpes.index:
baskets = pd.Series(index=years)
for y in years:
try:
baskets[y] = calc_spread(5, 2, (data['adv_percentile'] >= adv)
& (data.index.get_level_values(0).year == y))
except KeyError:
baskets[y] = np.nan
adv_sharpes[adv] = calc_spread(5, 2, data['adv_percentile'] >= adv) / baskets.std()
adv_sharpes.head()
Plotting the values below, it seems that low ADV floors give us higher consistency (possibly because more events are available), but consistent returns also increase for higher ADV floors because of the strategy's better performance for higher ADV values.
plt.bar(range(100), adv_sharpes, color=color)
plt.xlabel('minimum ADV percentile')
plt.ylabel('"Sharpe" ratio')
plt.title('Consistent returns by ADV percentile floor');
# Event Counts by Sector
data['sector'].value_counts()
"-1" for the sector means uncategorized. Since there are so few such events, we'll exclude them from this analysis.
SECTORS = [101, 102, 103, 104, 205, 206, 207, 308, 309, 310, 311]
SECTOR_NAMES = pd.Series({
101: 'Basic Materials',
102: 'Consumer Cyclical',
103: 'Financial Services',
104: 'Real Estate',
205: 'Consumer Defensive',
206: 'Healthcare',
207: 'Utilities',
308: 'Communication Services',
309: 'Energy',
310: 'Industrials',
311: 'Technology' ,
})
baskets = pd.Series(index=range(len(SECTORS)))
for i in range(len(SECTORS)):
baskets[i] = calc_spread(5, 3, data['sector'] == SECTORS[i])
plt.bar(baskets.index, baskets, color=color)
plt.xticks(range(len(SECTORS)), SECTOR_NAMES, rotation=25)
plt.title('Average returns by sector');
The strategy doesn't look too reliant on any particular sector. Communications Services is a little high, but that sector also has relatively few events.
Here I make a final robustness test by examining how the strategy does by sector and also by liquidity filter. I use both ADV percentile thresholds and the Q1500US and Q500US liquid universes.
liquidity_names = ['all', 'adv_50', 'adv_70', 'adv_90', 'q1500us', 'q500us']
liquidity_locs = {
'all': True,
'adv_50': data['adv_percentile'] >= 50,
'adv_70': data['adv_percentile'] >= 70,
'adv_90': data['adv_percentile'] >= 90,
'q1500us': data['q1500us'],
'q500us': data['q500us'],
}
spread_matrix = pd.DataFrame(columns=liquidity_names, index=SECTOR_NAMES, dtype=float)
freq_matrix = pd.DataFrame(columns=liquidity_names, index=SECTOR_NAMES, dtype=int)
for liquidity_filter in liquidity_names:
for sector in SECTORS:
loc = liquidity_locs[liquidity_filter] & (data['sector'] == sector)
spread_matrix.loc[SECTOR_NAMES[sector], liquidity_filter] = calc_spread(5, 3, loc)
freq_matrix.loc[SECTOR_NAMES[sector], liquidity_filter] = len(data[loc])
freq_matrix.loc['all'] = freq_matrix.sum()
spread_matrix.loc['all'] = (spread_matrix * freq_matrix).sum() / freq_matrix.loc['all']
ax = sns.heatmap(spread_matrix, annot=True)
ax.set(xlabel='liquidity filter', ylabel='sector', title='Returns by sector and liquidity filter')
print 'Event frequency:'
freq_matrix
The liquidity filters that perform best are ADV percentile >= 90 and Q500US, and they look fairly consistent across sectors. Sector 308 (communications) is very profitable on average, but it also has relatively few events, so that may be pure chance.
With results from both the empirical and robustness tests suggesting a possibility for a viable trading strategy, I take a look at a few of the parameters and attempt to roughly optimize for returns. It's important to note here that the following optimizations have a risk of overfitting the strategy parameters.
The core framework of the strategy is provided by Quantpedia:
>The investment universe consists of stocks listed at NYSE, AMEX, and NASDAQ, whose daily price data are available at CRSP database. Earnings-announcement dates are collected from Compustat. Firstly, the investor sorts stocks into quintiles based on firm size. Then he further sorts the stocks in the top quintile (the biggest) into quintiles based on their average returns in the 3-day window between t-4 and t-2, where t is the day of earnings announcement. The investor goes long on the bottom quintile (past losers) and short on the top quintile (past winners) and holds the stocks during the 3-day window between t-1, t, and t+1. Stocks in the portfolios are weighted equally.
The original paper uses a lookback window of 3 days (t-4 to t-2) and holding for 2 days (t-1 to t+1), but I explore a few other parameters as possible optimizations.
ax = sns.heatmap(create_spread_matrix(), annot=True)
ax.set(title='Cumulative returns by PAR lookback and days since t-1',
xlabel='PAR lookback',
ylabel='days since t-1');
The above heatmap suggests that a PAR lookback window of 5 days is strong compared to other values. For the rest of this section, I'll stick to using the PAR lookback of 5 for the best results.
The heatmap above also suggests there may be some returns to be squeezed out by holding for longer periods of time, much longer than 2 days as the paper recommends. The below heatmap shows how marginal returns look for each day when holding for a long period.
ax = sns.heatmap(create_spread_matrix(lambda n: n - 1), annot=True)
ax.set(title='Marginal returns by PAR lookback and days since t-1',
xlabel='PAR lookback',
ylabel='days since t-1');
Marginal returns are clearly strong for the first 2 days that the stocks are held; marginal returns afterward are small but positive on average. This suggests that a good strategy should prioritize holding stocks for the crucial first 2 days, but could hold on longer if it has excess cash for mild returns.
The paper specifies that for the strategy to perform better, I should use only those firms with a high market cap. However, Average Dollar Volume is also a commonly used proxy for liquidity and the choice of either market cap or ADV constraints can have significantly different results for a trading strategy.
baskets = np.zeros(100)
for i in range(100):
baskets[i] = calc_spread(5, 2, data['market_cap_percentile'] >= i)
plt.bar(range(100), baskets, color=color)
plt.xlabel('minimum market cap percentile')
plt.ylabel('t-1 to t+1 returns spread')
plt.title('Returns increase as market cap floor increases')
plt.ylim([0, 0.012]);
It seems that the paper was accurate in that this strategy works significantly better with high market cap stocks.
But another important factor in the feasibility of the strategy is trading in liquid stocks, and average dollar volume can be used as a proxy to firm size. The same plot above is shown below by ADV percentile rather than market cap.
baskets = np.zeros(100)
for i in range(100):
baskets[i] = calc_spread(5, 2, data['adv_percentile'] >= i)
plt.bar(range(100), baskets, color=color)
plt.xlabel('minimum ADV percentile')
plt.ylabel('t-1 to t+1 returns spread')
plt.title('Returns increase as ADV floor increases')
plt.ylim([0, 0.012]);
As you can see, our spread (and thus the long-short strategy's returns) increase as we increase the minimum ADV. But obviously this decreases the number of events we have to work with, so there's a trade-off.
For the rest of our study we'll set a minimum ADV threshold of the 95th percentile to improve our results. Let's try charting the earnings announcement reversal with our optimized PAR lookback of 5 days and this ADV filter.
means = pd.DataFrame(columns=range(RETURNS_QUANTILES), dtype=float)
for q in means.columns:
subset = price_data_matched.loc[(data['par_5'] == q) & (data['adv_percentile'] >= 95)]
means[q] = (subset.transpose() / subset[0]).loc[-1:].mean(axis=1)
means.plot()
plt.xlabel('days since t-1')
plt.ylabel('mean price, normalized at t-1')
plt.title('Average cumulative returns over time by PAR quantile, ADV percentile >= 95');
The above spread looks a lot more impressive with our improved parameters. Let's do a couple quick event studies on the events that meet our criteria: firms with an ADV percentile of 95 or greater, and the first or last quantile in 5-day PAR.
earnings_announcements = data[data['adv_percentile'] >= 95].reset_index().rename(
columns={'level_0': 'date', "level_1": "sid"}
)
q0 = earnings_announcements[earnings_announcements['par_5'] == 0]
results = custom_event_study(q0, date_column='date',
start_date=START,
end_date=END,
benchmark=BENCHMARK, days_before=1, days_after=10, top_liquid=500,
use_liquid_stocks=False, plot_error=False, price_data=price_data)
q4 = earnings_announcements[earnings_announcements['par_5'] == RETURNS_QUANTILES - 1]
results = custom_event_study(q4, date_column='date',
start_date=START,
end_date=END,
benchmark=BENCHMARK, days_before=1, days_after=10, top_liquid=500,
use_liquid_stocks=False, plot_error=False, price_data=price_data)
Compared to the event study in the empirical tests section, the magnitude of price reversals are much more clear here as well as the spreads between the best and worst performing PAR quantiles.
Up till now, the research has given enough confidence that there is possibility for a viable trading strategy. You can find those results in the backtest.
So and Wang's paper identifies an interesting possibility: investors may be betting too heavily on their expectations, not adjusting properly when market makers bend a stock's price accordingly. There's room for a strategy to take advantage of the market makers' temporary price adjustment by betting against investors' expectations.
Based on 1996-2011 data, the authors found an average return rate of 1.448% per transaction with a 3-day PAR lookback, 2 day hold strategy. Using 2007-2016 data, we were able to find an average return rate of 0.449%, which can be improved to just over 0.6% with reasonable firm size/liquidity filters and a PAR lookback of 5.
As this attemps an OOS validation of the original strategy, there appears to be some alpha decay for this strategy. However, it raises important questions about the accuracy of investor expectations. Earnings announcements are just one kind of event, and perhaps this strategy can be expanded to other anticipated events such as economic reports. This kind of market inefficiency could be widespread, and thus very profitable for those willing to harness it (See Predicting Earnings with Corporate Actions).
References:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
from pytz import utc
from odo import odo
import scipy
import math
from zipline.utils import tradingcalendar
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.factors import Returns, AverageDollarVolume, CustomFactor, SimpleMovingAverage
from quantopian.pipeline.factors.eventvestor import BusinessDaysUntilNextEarnings
from quantopian.pipeline.filters.morningstar import IsPrimaryShare, Q500US, Q1500US
from quantopian.pipeline.data import morningstar as mstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.classifiers.morningstar import Sector
def create_spread_matrix(start_func=lambda n: 0):
# Iterate through the PAR and hold length values; calculate the spread for each one.
par_range = range(2, 11)
hold_range = range(1, MAX_DAYS_TO_HOLD + 1)
spread_matrix = pd.DataFrame(columns=par_range, index=hold_range, dtype=float)
for par_lookback in par_range:
for days_to_hold in hold_range:
spread_matrix.loc[days_to_hold, par_lookback] = calc_spread(par_lookback, days_to_hold,
start_func=start_func)
return spread_matrix
def returns(sell_day, loc, start=0):
# Here's a function that calculates average returns for a
# specific hold length on any subset of the events.
# Calculate returns for holding specified stocks (`loc`) until a specified day.
price_data_subset = price_data_matched.loc[loc]
by_stock = (price_data_subset[sell_day] / price_data_subset[start]) - 1
return by_stock[(by_stock - by_stock.mean()).abs() <= 3 * by_stock.std()]
def calc_spread(par_lookback, sell_day, loc=True, start_func=lambda n: 0):
# Calculate returns spread between first and last PAR quantiles.
par_label = 'par_%d' % par_lookback
return (returns(sell_day, (data[par_label] == 0) & loc, start_func(sell_day)).mean()
- returns(sell_day, (data[par_label] == RETURNS_QUANTILES - 1) & loc, start_func(sell_day)).mean())
def fill_prices(df):
# Get the desired pricing data for events that took place on the same day.
day_loc = cal.get_loc(df.index.get_level_values(0)[0])
prices = get_pricing(set(df.index.get_level_values(1)) | {BENCHMARK},
cal[day_loc + PRICE_DAY_OFFSETS[0]],
cal[day_loc + PRICE_DAY_OFFSETS[-1]],
fields='open_price').transpose()
prices.columns = PRICE_DAY_OFFSETS
return prices
def variable_lookback_returns(max_lookback, **kwargs):
# Create a multiple-output returns custom factor, with an output for each lookback length.
class VariableLookbackReturns(CustomFactor):
inputs = [USEquityPricing.close]
window_length = max_lookback
outputs = map(lambda n: 'par_%d' % n, range(2, max_lookback + 1))
def compute(self, today, assets, out, closes):
for i in range(2, max_lookback + 1):
out['par_%d' % i][:] = (closes[-1] / closes[-i]) - 1
return VariableLookbackReturns(**kwargs)
def offset_returns_factor(days, **kwargs):
# Creates a custom factor for returns n days ago.
kwargs['window_length'] += days
class OffsetReturns(CustomFactor):
inputs = [USEquityPricing.close]
def compute(self, today, assets, out, closes):
out[:] = (closes[-days - 1] / closes[0]) - 1
return OffsetReturns(**kwargs)
def split_run_pipeline(pipeline, start, end, segments):
dts = np.arange(start, end, (end - start) / segments)
if len(dts) == segments:
dts = np.append(dts, end)
return pd.concat(map(
lambda i: run_pipeline(pipeline, dts[i], dts[i + 1] - pd.Timedelta('1d')),
range(segments)
))
def filter_universe(min_price = 0., min_volume = 0.):
"""
Computes a security universe based on nine different filters:
1. The security is common stock
2 & 3. It is not limited partnership - name and database check
4. The database has fundamental data on this stock
5. Not over the counter
6. Not when issued
7. Not depository receipts
8. Is Primary share
9. Has high dollar volume
Returns
-------
high_volume_tradable - zipline.pipeline.factor.Rank
A ranked AverageDollarVolume factor that's filtered on the nine criteria
"""
common_stock = mstar.share_class_reference.security_type.latest.eq('ST00000001')
not_lp_name = ~mstar.company_reference.standard_name.latest.matches('.* L[\\. ]?P\.?$')
not_lp_balance_sheet = mstar.balance_sheet.limited_partnership.latest.isnull()
have_data = mstar.valuation.market_cap.latest.notnull()
not_otc = ~mstar.share_class_reference.exchange_id.latest.startswith('OTC')
not_wi = ~mstar.share_class_reference.symbol.latest.endswith('.WI')
not_depository = ~mstar.share_class_reference.is_depositary_receipt.latest
primary_share = IsPrimaryShare()
# Combine the above filters.
tradable_filter = (common_stock & not_lp_name & not_lp_balance_sheet &
have_data & not_otc & not_wi & not_depository & primary_share)
price = SimpleMovingAverage(inputs=[USEquityPricing.close],
window_length=252, mask=tradable_filter)
volume = SimpleMovingAverage(inputs=[USEquityPricing.volume],
window_length=252, mask=tradable_filter)
full_filter = tradable_filter & (price >= min_price) & (volume >= min_volume)
high_volume_tradable = AverageDollarVolume(
window_length=252,
mask=full_filter
).rank(ascending=False)
return high_volume_tradable
class SidFactor(CustomFactor):
"""
Workaround to screen by sids in pipeline
Credit: Luca
"""
inputs = []
window_length = 1
sids = []
def compute(self, today, assets, out):
out[:] = np.in1d(assets, self.sids)
def get_liquid_universe_of_stocks(start_date, end_date, top_liquid=500):
"""
Gets the top X number of securities based on the criteria defined in
`filter_universe`
Parameters
----------
start_date : string or pd.datetime
Starting date for universe computation.
end_date : string or pd.datetime
End date for universe computation.
top_liquid : int, optional
Limit universe to the top N most liquid names in time period.
Based on 21 day AverageDollarVolume
Returns
-------
security_universe : list
List of securities that match the universe criteria
"""
pipe = Pipeline()
pipe.add(AverageDollarVolume(window_length=1), 'liquidity')
pipe.set_screen((filter_universe() < top_liquid))
data = run_pipeline(pipe, start_date=start_date, end_date=end_date)
security_universe = data.index.levels[1].unique().tolist()
return security_universe
def get_cum_returns(prices, sid, date, days_before, days_after, benchmark_sid):
"""
Calculates cumulative and abnormal returns for the sid & benchmark
Parameters
----------
prices : pd.DataFrame
Pricing history DataFrame obtained from `get_pricing`. Index should
be the datetime index and sids should be columns.
sid : int or zipline.assets._assets.Equity object
Security that returns are being calculated for.
date : datetime object
Date that will be used as t=0 for cumulative return calcuations. All
returns will be calculated around this date.
days_before, days_after : int
Days before/after to be used to calculate returns for.
benchmark : int or zipline.assets._assets.Equity object
Returns
-------
sid_returns : pd.Series
Cumulative returns time series from days_before ~ days_after from date
for sid
benchmark_returns : pd.Series
Cumulative returns time series for benchmark sid
abnormal_returns : pd.Series
Abnomral cumulative returns time series for sid compared against benchmark
"""
day_zero_index = prices.index.searchsorted(date)
starting_index = max(day_zero_index - days_before, 0)
ending_index = min(day_zero_index + days_after + 1, len(prices.index) - 1)
if starting_index < 0 or ending_index >= len(prices.index):
return None
if sid == benchmark_sid:
temp_price = prices.iloc[starting_index:ending_index,:].loc[:,[sid]]
else:
temp_price = prices.iloc[starting_index:ending_index,:].loc[:,[sid, benchmark_sid]]
beta = calc_beta(sid, benchmark_sid, temp_price)
if beta is None:
return
daily_ret = temp_price.pct_change().fillna(0)
daily_ret['abnormal_returns'] = daily_ret[sid] - beta*daily_ret[benchmark_sid]
cum_returns = (daily_ret + 1).cumprod() - 1
try:
# If there's not enough data for event study,
# return None
cum_returns.index = range(starting_index - day_zero_index,
ending_index - day_zero_index)
except:
return None
sid_returns = cum_returns[sid] - cum_returns[sid].ix[0]
bench_returns = cum_returns[benchmark_sid] - cum_returns[benchmark_sid].ix[0]
abnormal_returns = cum_returns['abnormal_returns'] - cum_returns['abnormal_returns'].ix[0]
return sid_returns, bench_returns, abnormal_returns
def calc_beta(sid, benchmark, price_history):
"""
Calculate beta amounts for each security
Parameters
----------
sid : int or zipline.assets._assets.Equity object
Security that beta is being calculated for.
benchmark : int or zipline.assets._assets.Equity object
Benchmark that will be used to determine beta against
price_history: pd.DataFrame
DataFrame that contains pricing history for benchmark and
sid. Index is a datetimeindex and columns are sids. Should
already be truncated for date_window used to calculate beta.
Returns
-------
beta : float
Beta of security against benchmark calculated over the time
window contained in price_history
"""
if sid == benchmark:
return 1.0
stock_prices = price_history[sid].pct_change().dropna()
bench_prices = price_history[benchmark].pct_change().dropna()
aligned_prices = bench_prices.align(stock_prices,join='inner')
bench_prices = aligned_prices[0]
stock_prices = aligned_prices[1]
bench_prices = np.array( bench_prices.values )
stock_prices = np.array( stock_prices.values )
bench_prices = np.reshape(bench_prices,len(bench_prices))
stock_prices = np.reshape(stock_prices,len(stock_prices))
if len(stock_prices) == 0:
return None
regr_results = scipy.stats.linregress(y=stock_prices, x=bench_prices)
beta = regr_results[0]
p_value = regr_results[3]
if p_value > 0.05:
beta = 0.
return beta
def define_xticks(days_before, days_after):
"""
Defines a neat xtick label axis on multipes of 2 using X days before
and X days after.
Parameters
----------
days_before : int
Positive integer detailing the numbers of days before event date
days_after : int
Postiive integer detailing the number of days after an event date
Returns
-------
list : List of integers on multiples of 2 from [-days_before ~ days_after)
"""
day_numbers = [i for i in range(-days_before+1, days_after)]
xticks = [d for d in day_numbers if d%2 == 0]
return xticks
def plot_distribution_of_events(event_data, date_column, start_date, end_date):
"""
Plots the distribution of events
Parameters
----------
event_data : pd.DataFrame
DataFrame that contains the events data with date and sid columns as
a minimum. See interactive tutorials on quantopian.com/data
date_column : String
String that labels the date column to be used for the event. e.g. `asof_date`
start_date, end_date : Datetime
Start and end date to be used for the cutoff for the distribution plots
"""
event_data = event_data[(event_data[date_column] > start_date) &
(event_data[date_column] < end_date)]
s = pd.Series(event_data[date_column])
sns.set_palette('coolwarm')
distribution = s.groupby([s.dt.year, s.dt.month]).count()
distribution.plot(kind="bar", grid=False,
color=sns.color_palette())
plt.title("Distribution of events in time")
plt.ylabel("Number of event")
plt.xlabel("Date")
return distribution
def plot_cumulative_returns(cumulative_returns, days_before, days_after):
"""
Plots a cumulative return chart
Parameters
----------
cumulative_returns : pd.series
Series that contains the cumulative returns time series from
days_before ~ days_after from date for sid. See `get_cum_returns
days_before, days_after : Datetime
Positive integer detailing the numbers of days before/after event date
"""
xticks = define_xticks(days_before, days_after)
cumulative_returns.plot(xticks=xticks)
plt.grid(b=None, which=u'major', axis=u'y')
plt.title("Cumulative Return before and after event")
plt.xlabel("Window Length (t)")
plt.ylabel("Cumulative Return (r)")
plt.legend(["N=%s" % cumulative_returns.name], loc='best')
def plot_cumulative_returns_with_error_bars(cumulative_returns, returns_with_error,
days_before, days_after, abnormal=False):
"""
Plots a cumulative return chart with error bars. Can choose between abnormal returns
and simple returns
Parameters
----------
cumulative_returns : pd.Series
Series that contains the cumulative returns time series from
days_before ~ days_after from date for sid. See `get_cum_returns
returns_with_error: pd.Series
Series that contains the standard deviation of returns passed in through
`cumulative_returns`. See `get_returns`
days_before, days_after : Datetime
Positive integer detailing the numbers of days before/after event date
abnormal : Boolean, optional
If True, will plot labels indicating an abnormal returns chart
"""
xticks = define_xticks(days_before, days_after)
returns_with_error.ix[:-1] = 0
plt.errorbar(cumulative_returns.index, cumulative_returns, xerr=0, yerr=returns_with_error)
plt.grid(b=None, which=u'major', axis=u'y')
if abnormal:
plt.title("Cumulative Abnormal Return before and after event with error")
else:
plt.title("Cumulative Return before and after event with error")
plt.xlabel("Window Length (t)")
plt.ylabel("Cumulative Return (r)")
plt.legend()
def plot_cumulative_returns_against_benchmark(cumulative_returns,
benchmark_returns,
days_before, days_after):
"""
Plots a cumulative return chart against the benchmark returns
Parameters
----------
cumulative_returns, benchmark_returns : pd.series
Series that contains the cumulative returns time series from
days_before ~ days_after from date for sid/benchmark. See `get_cum_returns`
days_before, days_after : Datetime
Positive integer detailing the numbers of days before/after event date
"""
xticks = define_xticks(days_before, days_after)
cumulative_returns.plot(xticks=xticks, label="Event")
benchmark_returns.plot(xticks=xticks, label='Benchmark')
plt.title("Benchmark cum returns versus Event")
plt.ylabel("% Cumulative Return")
plt.xlabel("Time Window")
plt.legend(["Event", 'Benchmark'], loc='best')
plt.grid(b=None, which=u'major', axis=u'y')
def plot_cumulative_abnormal_returns(cumulative_returns,
abnormal_returns,
days_before, days_after):
"""
Plots a cumulative return chart against the abnormal returns
Parameters
----------
cumulative_returns, abnormal_returns : pd.series
Series that contains the cumulative returns time series against abnormal returns
from days_before ~ days_after from date for sid. See `get_cum_returns`
days_before, days_after : Datetime
Positive integer detailing the numbers of days before/after event date
"""
xticks = define_xticks(days_before, days_after)
abnormal_returns.plot(xticks=xticks, label="Abnormal Average Cumulative")
cumulative_returns.plot(xticks=xticks, label="Simple Average Cumulative")
plt.axhline(y=abnormal_returns.ix[0], linestyle='--', color='black', alpha=.3)
plt.axhline(y=abnormal_returns.max(), linestyle='--', color='black', alpha=.3)
plt.title("Cumulative Abnormal Returns versus Cumulative Returns")
plt.ylabel("% Cumulative Return")
plt.xlabel("Time Window")
plt.grid(b=None, which=u'major', axis=u'y')
plt.legend(["Abnormal Average Cumulative","Simple Average Cumulative"], loc='best')
def get_returns(event_data, benchmark, date_column, days_before, days_after,
use_liquid_stocks=False, top_liquid=1000, price_data=None):
"""
Calculates cumulative returns, benchmark returns, abnormal returns, and
volatility for cumulative and abnomral returns
Parameters
----------
event_data : pd.DataFrame
DataFrame that contains the events data with date and sid columns as
a minimum. See interactive tutorials on quantopian.com/data
benchmark : string, int, zipline.assets._assets.Equity object
Security to be used as benchmark for returns calculations. See `get_returns`
date_column : String
String that labels the date column to be used for the event. e.g. `asof_date`
days_before, days_after : Datetime
Positive integer detailing the numbers of days before/after event date
use_liquid_stocks : Boolean
If set to True, it will filter out any securities found in `event_data`
according to the filters found in `filter_universe`
top_liquid : Int
If use_liquid_stocks is True, top_liquid determines the top X amount of stocks
to return ranked on liquidity
Returns
-------
cumulative_returns, benchmark_returns, abnormal_returns
returns_volatiliy, abnormal_returns_volatility : pd.Series
valid_sids: list
Used to graph distribution of events (in case of use_liquid_stocks flag)
"""
cumulative_returns = []
benchmark_returns = []
abnormal_returns = []
valid_sids = []
liquid_stocks = None
print "Running Event Study"
for i, row in event_data[['sid', date_column]].iterrows():
sid, date = row
# Getting 10 extra days of data just to be sure
extra_days_before = math.ceil(days_before * 365.0/252.0) + 10
start_date = date - timedelta(days=extra_days_before)
extra_days_after = math.ceil(days_after * 365.0/252.0) + 10
end_date = date + timedelta(days=extra_days_after)
if use_liquid_stocks:
if liquid_stocks is None:
liquid_stocks = get_liquid_universe_of_stocks(date, date, top_liquid=top_liquid)
if sid not in liquid_stocks:
continue
valid_sids.append(sid)
# duplicated columns would break get_cum_returns
pr_sids = set([sid, benchmark])
if price_data is None:
prices = get_pricing(pr_sids, start_date=start_date,
end_date=end_date, fields='open_price')
prices = prices.shift(-1)
else:
prices = price_data.loc[date].loc[pr_sids].transpose()
date = 0
if date in prices.index:
results = get_cum_returns(prices, sid, date, days_before, days_after, benchmark)
if results is None:
print "Discarding event for %s on %s" % (symbols(sid),date)
continue
sid_returns, b_returns, ab_returns = results
cumulative_returns.append(sid_returns)
benchmark_returns.append(b_returns)
abnormal_returns.append(ab_returns)
sample_size = len(cumulative_returns)
returns_volatility = pd.concat(cumulative_returns, axis=1).std(axis=1)
abnormal_returns_volatility = pd.concat(abnormal_returns, axis=1).std(axis=1)
benchmark_returns = pd.concat(benchmark_returns, axis=1).mean(axis=1)
abnormal_returns = pd.concat(abnormal_returns, axis=1).mean(axis=1)
cumulative_returns = pd.concat(cumulative_returns, axis=1).mean(axis=1)
cumulative_returns.name = sample_size
return (cumulative_returns, benchmark_returns, abnormal_returns,
returns_volatility, abnormal_returns_volatility, valid_sids)
def custom_event_study(event_data, date_column='asof_date',
start_date='2007-01-01', end_date='2014-01-01',
benchmark=None, days_before=10, days_after=10, top_liquid=500,
use_liquid_stocks=True, plot_error=False, price_data=None):
"""
Calculates simple & cumulative returns for events and plots stock price movement
before and after the event date.
Parameters
----------
event_data : pd.DataFrame
DataFrame that contains the events data with date and sid columns as
a minimum. See interactive tutorials on quantopian.com/data
date_column : String
String that labels the date column to be used for the event. e.g. `asof_date`
start_date, end_date : Datetime
Start and end date to be used for the cutoff for the evenet study
benchmark : int or zipline.assets._assets.Equity object
Security to be used as benchmark for returns calculations. See `get_returns`
days_before, days_after : int
Days before/after to be used to calculate returns for.
top_liquid : Int
If use_liquid_stocks is True, top_liquid determines the top X amount of stocks
to return ranked on liquidity
use_liquid_stocks : Boolean
If set to True, it will filter out any securities found in `event_data`
according to the filters found in `filter_universe`
"""
if date_column not in event_data or not isinstance(event_data, pd.DataFrame) or 'sid' not in event_data:
raise KeyError("event_data not properly formatted for event study. Please make sure " \
"date_column and 'sid' are both present in the DataFrame")
if isinstance(benchmark, str):
raise TypeError("Benchmark must be an equity object. Please use symbols('ticker') to" \
"set your benchmark")
if benchmark is None:
benchmark = symbols('SPY')
print "Formatting Data"
start_date = pd.to_datetime(start_date)
end_date = pd.to_datetime(end_date)
event_data = event_data[(event_data[date_column] > start_date) &
(event_data[date_column] < end_date)]
event_data.sid = event_data.sid.apply(lambda x: int(x))
print "Getting Plots"
cumulative_returns, benchmark_returns, abnormal_returns, returns_volatility, \
abnormal_returns_volatility, valid_sids = get_returns(event_data, benchmark, date_column,
days_before, days_after,
use_liquid_stocks=use_liquid_stocks,
top_liquid=top_liquid,
price_data=price_data)
event_data = event_data[event_data.sid.isin(valid_sids)]
plt.subplot(5, 2, 1)
distribution = plot_distribution_of_events(event_data, date_column, start_date, end_date)
plt.subplot(5, 2, 2)
plot_cumulative_returns(cumulative_returns, days_before, days_after)
plt.subplot(5, 2, 5)
plot_cumulative_returns_against_benchmark(cumulative_returns, benchmark_returns,
days_before, days_after)
plt.subplot(5, 2, 6)
plot_cumulative_abnormal_returns(cumulative_returns, abnormal_returns,
days_before, days_after)
if plot_error:
plt.subplot(5, 2, 9)
plot_cumulative_returns_with_error_bars(cumulative_returns, returns_volatility,
days_before, days_after)
plt.subplot(5, 2, 10)
plot_cumulative_returns_with_error_bars(cumulative_returns, abnormal_returns_volatility,
days_before, days_after, abnormal=True)
return cumulative_returns, benchmark_returns, abnormal_returns, returns_volatility, \
abnormal_returns_volatility, valid_sids, distribution