Notebook

Quantpedia Series: Reversal during Earnings Announcements

By Nathan Wolfe

This research is published in partnership with Quantpedia, an online resource for discovering new trading ideas.

You can view the full Quantpedia series in the library along with other research and strategies.


</a>

Before Proceeding: Click here to import necessary functions


Whitepaper authors: Eric C. So (ESo@mit.edu), Sean Wang

Whitepaper source: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2275982

Abstract

From Quantpedia:

> In general, reversal in price of an asset occurs due to investors' overreaction to asset-related news and the subsequent price correction. In this case, the most probable reason for the phenomenon, according to the authors, is the market makers‘ aversion to inventory risks that tend to increase dramatically in the pre-announcement period. Consequently, the market makers demand higher compensation for providing liquidity due to higher risk and therefore raise prices, which are expected to reverse after the earnings announcement.

As the paper does, I find evidence of returns reversal during earnings announcements; while the paper tested using data from 1996 to 2011, I used data from 2007 to 2016. The average reversal among all stocks in my data is 0.449%, compared to a result of 1.448% in the paper. I found that we can reasonably increase the reversal to 0.6% by selecting firms based on a minimum average dollar volume percentile, or based on a minimum market cap.

Introduction

>Several market frictions have the potential to significantly impact the efficiency and information content of market prices. This study focuses on the friction that arises from the need to locate a counterparty in order to complete a trade. Market makers typically mitigate this friction by matching would-be sellers with would-be buyers. When there is an imbalance between the quantities sought by buyers and sellers at a given price, market makers may absorb the order imbalance into their own account by serving as the trade counterparty. This practice is commonly known as liquidity provision.

Upon anticipation of a good (bad) earnings report, investors want to go long (short) in the stock; they often don't consider whether the stock is currently over- or under-priced, simply assuming that it will get a positive (negative) bump upon the release of the earnings report. Meanwhile, market makers see an imbalance in demand for long (short) positions; they thus raise (lower) the price to compensate for providing the liquidity necessary to fill the imbalance. Investors usually don't care about this price change, assuming the bump will come from the report.

After the earnings announcement, the demand imbalance should disappear, and with it the market makers' need for the price adjustment. Thus the market makers reverse the recent price change which causes a short-term reversal.

Our strategy will be to assume that new information provided by the release of the earnings report will be neutral on average. All that remains is to take advantage of the market makers' adjustment of the stock price, by taking an opposite position and waiting for them to reverse the change.

>We show that a long (short) position in firms whose returns strongly underperform (outperform) the market in the three days prior to earnings announcements yields an average return of 145 basis points (bps) during the announcement window. By comparison, the average return to a comparable portfolio during non-announcement periods is 22 bps, indicating that return reversals increase more than six-fold during earnings announcements.

While I didn't compare to non-announcement conditions, I did find clear low/high reversals for the best/worst performing securities prior to earnings. I also examined how various universe selection conditions would affect the returns of this strategy; while the authors selected based on market cap, I explored using average dollar volume as well as our Q500US and Q1500US liquid stock universes finding that a minimum average dollar volume filter and Q500US universe both performed well as universe selectors.

Table of Contents

You can navigate the rest of this notebook as follows:

  1. Methodology and Emprical Tests
  2. Robustness Tests
  3. Strategy Creation
  4. Conclusion
In [3]:
# Basic Variables
END = pd.Timestamp.now('US/Eastern').replace(tzinfo=utc) - pd.Timedelta('7d')

# For the free set, comment the above line and uncomment the below:
#END = pd.Timestamp.now('US/Eastern').replace(tzinfo=utc) - pd.Timedelta('731d')

START = pd.Timestamp('2007-01-01', tz=utc)
BENCHMARK = symbols('SPY')
RETURNS_QUANTILES = 5
PRICE_WINDOW = 8
MAX_DAYS_TO_HOLD = 20
PRICE_DAY_OFFSETS = range(-1, MAX_DAYS_TO_HOLD + 1)

# plot colors
color = '#0062AE'

</a>

1. Methodology and Empirical Tests

The methodology behind the study is based on the idea that a stock's short-term returns will reverse during an earnings announcement as an inbalance in demand dissipates. The data used in this Research Notebook is sourced from EventVestor's Earnings Calendar Dataset. The sample version is available from 2007 up till 2 years ago while the premium version is available up till present day.

Data Aggregation & Sample Selection

Below, a Pipeline pulls all earnings announcements of which prior knowledge of the date existed, along with data on the day before each about the company's average dollar volume, market cap, sector, and short-term returns.

I eliminate firms with prices below $5 to mitigate the influence of bid-ask bounce on our calculation of return reversals as noted in the paper. Our data starts on 2007-01-01 and spans up through the present day (2 years ago for the free set).

Definition: "PAR" stands for "Pre-Announcement Returns," referring to the stocks' short-term returns prior to their earnings announcements.

In [4]:
# Basic universe filters
tradable_filter = IsPrimaryShare() & mstar.valuation.market_cap.latest.notnull()
days_filter = BusinessDaysUntilNextEarnings().eq(1)
price_filter = USEquityPricing.close.latest > 5

# Factors for returns and liquidity
adv = AverageDollarVolume(mask=USEquityPricing.volume.latest > 0, window_length=30)
vlr = variable_lookback_returns(10, mask=adv.notnan())
market_cap = mstar.valuation.market_cap.latest

# Pipeline itself to calculate PAR
pipe = Pipeline(columns={'adv_percentile': adv.quantiles(100),
                         'sector': mstar.asset_classification.morningstar_sector_code.latest,
                         'q500us': Q500US(),
                         'q1500us': Q1500US(),
                         'market_cap': market_cap,
                         'market_cap_percentile': market_cap.quantiles(100),
                         # -3 to -2
                         'par_2': vlr.par_2.quantiles(RETURNS_QUANTILES),
                         # Used by the authors in original study: -4 to -2 
                         'par_3': vlr.par_3.quantiles(RETURNS_QUANTILES),
                         # -5 to - 2
                         'par_4': vlr.par_4.quantiles(RETURNS_QUANTILES),
                         # -6 to -2
                         'par_5': vlr.par_5.quantiles(RETURNS_QUANTILES),
                         # -7 to -2
                         'par_6': vlr.par_6.quantiles(RETURNS_QUANTILES),
                         # -8 to -2
                         'par_7': vlr.par_7.quantiles(RETURNS_QUANTILES),
                         # -9 to -2
                         'par_8': vlr.par_8.quantiles(RETURNS_QUANTILES),
                         # -10 to -2
                         'par_9': vlr.par_9.quantiles(RETURNS_QUANTILES),
                         # -11 to -2
                         'par_10': vlr.par_10.quantiles(RETURNS_QUANTILES),
                         },
                screen=days_filter & tradable_filter & price_filter & (vlr.par_10.notnan()))

data = split_run_pipeline(pipe, START, END, 16)

# Get before/ after pricing data for all announcement events using `get_pricing`.
cal = tradingcalendar.get_trading_days(START - pd.Timedelta('20d'),
                                       END + pd.Timedelta(days=MAX_DAYS_TO_HOLD * 2))
price_data = data.groupby(level=0).apply(fill_prices)
price_data_matched = price_data.reindex(data.index)
print 'Done fetching data.'
Done fetching data.
In [5]:
print "This sample consists of %s earnings announcements spanning %s to %s." % (len(data),
                                                                                START.year, END.year)
This sample consists of 83005 earnings announcements spanning 2007 to 2016.

Sample Statistics

> The extreme quintiles of PAR (i.e., quintiles Q5 and Q1) consist of firms that are generally smaller, possess lower book-to-market ratios and share prices, and have higher volatility and relative spreads.

Similar to the original whitepaper, I find that the lowest and highest quantiles are generally lower in market cap.

In [6]:
raw_data = data.reset_index()

raw_data.groupby("par_3").mean()[['market_cap', 'adv_percentile', 'market_cap_percentile']].iloc[1:]
Out[6]:
market_cap adv_percentile market_cap_percentile
par_3
0 4.417921e+09 67.095305 58.029011
1 7.338278e+09 71.215104 64.950549
2 8.365436e+09 72.470233 66.930130
3 7.918619e+09 72.070200 66.201380
4 5.741773e+09 68.866222 60.510481

</a>

Empirical Tests

This section documents the reversal effect during earnings by examining the spreads between the worst and best performing stocks by PAR returns quantile (0 vs 4 respectively). I find a final spread of 0.449% compared to the authors' final value of 1.448%.

The expected returns will be the first quantile minus the returns on the last quantile. This difference I will call the "spread," and will use "spread" and "returns" interchangeably.

Return spreads by PAR quantile

Below is a comparison average of t-1 to t+1 returns when all equities are divided into quantiles based on their returns in the 3 days prior, as suggested by the paper. It looks like we do have some spread between the first and last quantiles.

In [7]:
baskets = np.zeros(RETURNS_QUANTILES)
for i in range(RETURNS_QUANTILES):
    baskets[i] = returns(2, data['par_3'] == i).mean()
plt.bar(range(RETURNS_QUANTILES), baskets, color=color, alpha=.8)
plt.xlabel('t-4 to t-2 returns quantile')
plt.ylabel('average t-1 to t+1 returns')
plt.title('Average returns during earnings announcement by PAR quantile')
print 'Spread: %f' % calc_spread(3, 2)
Spread: 0.004516

The strategy that the paper suggests is to go long in the first quantile and short in the last quantile for a market-neutral strategy. Given the spread here, it looks like that approach may be fruitful. Compare our spread of 0.449% with the paper's final value of 1.448%. Perhaps the alpha has decayed since 2011, but there are still optimizations which might improve our returns.

The main point here is that the firms with the highest short-term returns prior to the announcement have the lowest short-term returns following their announcement, and vice-versa. This is illustrated by the following plot; note how the returns curves seem to bounce backward from their trend from right before the announcement.

In [8]:
means = pd.DataFrame(columns=range(RETURNS_QUANTILES), dtype=float)

for q in means.columns:
    subset = price_data_matched.loc[data['par_3'] == q]
    means[q] = (subset.transpose() / subset[0]).loc[-1:].mean(axis=1)

means.plot()
plt.xlabel('days since t-1')
plt.ylabel('mean price, normalized at t-1')
plt.title('Average cumulative returns over time by PAR quantile');

</a>

2. Robustness Tests

Up till now, my research supports that there is evidence of reversals in the best and worst performing stocks prior to an earnings announcement. This section is dedicated to examining the consistency of these results through a number of dimensions: Liquidity, Sector, and Liquidity + Sector.

In [9]:
years = range(START.year, END.year + 1)
baskets = pd.Series(index=years)
for y in years:
    baskets[y] = calc_spread(5, 2, data.index.get_level_values(0).year == y)
plt.bar(years, baskets, alpha=.9, color=color)
plt.xlabel('year')
plt.ylabel('average spread')
plt.title('Returns by year');

The consistency is variable as you'd expect from an events based strategy. Additionally, as noted in the main study section, there appears to be some alpha decay in the returns over time. Consistency is quantified by dividing the mean spread (returns) by the standard deviation of the yearly spreads as above which gives a primitive Shapre ratio.

Consistency by Liquidity

In [10]:
years = range(START.year, END.year + 1)
adv_sharpes = pd.Series(index=range(100))
for adv in adv_sharpes.index:
    baskets = pd.Series(index=years)
    for y in years:
        try:
            baskets[y] = calc_spread(5, 2, (data['adv_percentile'] >= adv)
                                     & (data.index.get_level_values(0).year == y))
        except KeyError:
            baskets[y] = np.nan
    adv_sharpes[adv] = calc_spread(5, 2, data['adv_percentile'] >= adv) / baskets.std()
adv_sharpes.head()
Out[10]:
0    0.592761
1    0.592761
2    0.592718
3    0.590417
4    0.591118
dtype: float64

Plotting the values below, it seems that low ADV floors give us higher consistency (possibly because more events are available), but consistent returns also increase for higher ADV floors because of the strategy's better performance for higher ADV values.

In [11]:
plt.bar(range(100), adv_sharpes, color=color)
plt.xlabel('minimum ADV percentile')
plt.ylabel('"Sharpe" ratio')
plt.title('Consistent returns by ADV percentile floor');

Consistency by Sector

In [12]:
# Event Counts by Sector
data['sector'].value_counts()
Out[12]:
 310    13047
 311    12994
 102    12333
 103    12264
 206     9258
 309     5608
 104     4987
 101     4448
 205     3889
 207     2719
 308     1299
-1        159
dtype: int64

"-1" for the sector means uncategorized. Since there are so few such events, we'll exclude them from this analysis.

In [13]:
SECTORS = [101, 102, 103, 104, 205, 206, 207, 308, 309, 310, 311]
SECTOR_NAMES = pd.Series({
 101: 'Basic Materials',
 102: 'Consumer Cyclical',
 103: 'Financial Services',
 104: 'Real Estate',
 205: 'Consumer Defensive',
 206: 'Healthcare',
 207: 'Utilities',
 308: 'Communication Services',
 309: 'Energy',
 310: 'Industrials',
 311: 'Technology' ,
})
baskets = pd.Series(index=range(len(SECTORS)))
for i in range(len(SECTORS)):
    baskets[i] = calc_spread(5, 3, data['sector'] == SECTORS[i])
plt.bar(baskets.index, baskets, color=color)
plt.xticks(range(len(SECTORS)), SECTOR_NAMES, rotation=25)
plt.title('Average returns by sector');

The strategy doesn't look too reliant on any particular sector. Communications Services is a little high, but that sector also has relatively few events.

Consistency by Sector & Liquidity

Here I make a final robustness test by examining how the strategy does by sector and also by liquidity filter. I use both ADV percentile thresholds and the Q1500US and Q500US liquid universes.

In [14]:
liquidity_names = ['all', 'adv_50', 'adv_70', 'adv_90', 'q1500us', 'q500us']
liquidity_locs = {
    'all': True,
    'adv_50': data['adv_percentile'] >= 50,
    'adv_70': data['adv_percentile'] >= 70,
    'adv_90': data['adv_percentile'] >= 90,
    'q1500us': data['q1500us'],
    'q500us': data['q500us'],
}
spread_matrix = pd.DataFrame(columns=liquidity_names, index=SECTOR_NAMES, dtype=float)
freq_matrix = pd.DataFrame(columns=liquidity_names, index=SECTOR_NAMES, dtype=int)
for liquidity_filter in liquidity_names:
    for sector in SECTORS:
        loc = liquidity_locs[liquidity_filter] & (data['sector'] == sector)
        spread_matrix.loc[SECTOR_NAMES[sector], liquidity_filter] = calc_spread(5, 3, loc)
        freq_matrix.loc[SECTOR_NAMES[sector], liquidity_filter] = len(data[loc])
freq_matrix.loc['all'] = freq_matrix.sum()
spread_matrix.loc['all'] = (spread_matrix * freq_matrix).sum() / freq_matrix.loc['all']
ax = sns.heatmap(spread_matrix, annot=True)
ax.set(xlabel='liquidity filter', ylabel='sector', title='Returns by sector and liquidity filter')
print 'Event frequency:'
freq_matrix
Event frequency:
Out[14]:
all adv_50 adv_70 adv_90 q1500us q500us
Basic Materials 4448 3904 2652 978 2493 910
Consumer Cyclical 12333 10302 7461 2628 7189 2503
Financial Services 12264 9004 5879 2081 5244 1858
Real Estate 4987 4445 3243 760 2949 659
Consumer Defensive 3889 3172 2382 1299 2283 1260
Healthcare 9258 7714 5217 1723 4807 1593
Utilities 2719 2526 2071 708 1920 642
Communication Services 1299 1115 739 396 702 365
Energy 5608 5150 4073 1829 3299 1652
Industrials 13047 10940 7142 2007 6789 1864
Technology 12994 10615 7069 2517 6566 2339
all 82846 68887 47928 16926 44241 15645

The liquidity filters that perform best are ADV percentile >= 90 and Q500US, and they look fairly consistent across sectors. Sector 308 (communications) is very profitable on average, but it also has relatively few events, so that may be pure chance.

</a>

3. Strategy Creation

With results from both the empirical and robustness tests suggesting a possibility for a viable trading strategy, I take a look at a few of the parameters and attempt to roughly optimize for returns. It's important to note here that the following optimizations have a risk of overfitting the strategy parameters.

The core framework of the strategy is provided by Quantpedia:

>The investment universe consists of stocks listed at NYSE, AMEX, and NASDAQ, whose daily price data are available at CRSP database. Earnings-announcement dates are collected from Compustat. Firstly, the investor sorts stocks into quintiles based on firm size. Then he further sorts the stocks in the top quintile (the biggest) into quintiles based on their average returns in the 3-day window between t-4 and t-2, where t is the day of earnings announcement. The investor goes long on the bottom quintile (past losers) and short on the top quintile (past winners) and holds the stocks during the 3-day window between t-1, t, and t+1. Stocks in the portfolios are weighted equally.

Optimal PAR lookback window

The original paper uses a lookback window of 3 days (t-4 to t-2) and holding for 2 days (t-1 to t+1), but I explore a few other parameters as possible optimizations.

In [15]:
ax = sns.heatmap(create_spread_matrix(), annot=True)
ax.set(title='Cumulative returns by PAR lookback and days since t-1',
       xlabel='PAR lookback',
       ylabel='days since t-1');

The above heatmap suggests that a PAR lookback window of 5 days is strong compared to other values. For the rest of this section, I'll stick to using the PAR lookback of 5 for the best results.

The heatmap above also suggests there may be some returns to be squeezed out by holding for longer periods of time, much longer than 2 days as the paper recommends. The below heatmap shows how marginal returns look for each day when holding for a long period.

Optimal holding period

In [16]:
ax = sns.heatmap(create_spread_matrix(lambda n: n - 1), annot=True)
ax.set(title='Marginal returns by PAR lookback and days since t-1',
       xlabel='PAR lookback',
       ylabel='days since t-1');

Marginal returns are clearly strong for the first 2 days that the stocks are held; marginal returns afterward are small but positive on average. This suggests that a good strategy should prioritize holding stocks for the crucial first 2 days, but could hold on longer if it has excess cash for mild returns.

Market Cap or ADV?

The paper specifies that for the strategy to perform better, I should use only those firms with a high market cap. However, Average Dollar Volume is also a commonly used proxy for liquidity and the choice of either market cap or ADV constraints can have significantly different results for a trading strategy.

In [17]:
baskets = np.zeros(100)
for i in range(100):
    baskets[i] = calc_spread(5, 2, data['market_cap_percentile'] >= i)
plt.bar(range(100), baskets, color=color)
plt.xlabel('minimum market cap percentile')
plt.ylabel('t-1 to t+1 returns spread')
plt.title('Returns increase as market cap floor increases')
plt.ylim([0, 0.012]);

It seems that the paper was accurate in that this strategy works significantly better with high market cap stocks.

But another important factor in the feasibility of the strategy is trading in liquid stocks, and average dollar volume can be used as a proxy to firm size. The same plot above is shown below by ADV percentile rather than market cap.

In [18]:
baskets = np.zeros(100)
for i in range(100):
    baskets[i] = calc_spread(5, 2, data['adv_percentile'] >= i)
plt.bar(range(100), baskets, color=color)
plt.xlabel('minimum ADV percentile')
plt.ylabel('t-1 to t+1 returns spread')
plt.title('Returns increase as ADV floor increases')
plt.ylim([0, 0.012]);

As you can see, our spread (and thus the long-short strategy's returns) increase as we increase the minimum ADV. But obviously this decreases the number of events we have to work with, so there's a trade-off.

Cumulative returns with ADV & PAR lookback parameters

For the rest of our study we'll set a minimum ADV threshold of the 95th percentile to improve our results. Let's try charting the earnings announcement reversal with our optimized PAR lookback of 5 days and this ADV filter.

In [ ]:
means = pd.DataFrame(columns=range(RETURNS_QUANTILES), dtype=float)

for q in means.columns:
    subset = price_data_matched.loc[(data['par_5'] == q) & (data['adv_percentile'] >= 95)]
    means[q] = (subset.transpose() / subset[0]).loc[-1:].mean(axis=1)

means.plot()
plt.xlabel('days since t-1')
plt.ylabel('mean price, normalized at t-1')
plt.title('Average cumulative returns over time by PAR quantile, ADV percentile >= 95');

The above spread looks a lot more impressive with our improved parameters. Let's do a couple quick event studies on the events that meet our criteria: firms with an ADV percentile of 95 or greater, and the first or last quantile in 5-day PAR.

Worst performing PAR quantile (0)

In [ ]:
earnings_announcements = data[data['adv_percentile'] >= 95].reset_index().rename(
    columns={'level_0': 'date', "level_1": "sid"}
)

q0 = earnings_announcements[earnings_announcements['par_5'] == 0]

results = custom_event_study(q0, date_column='date',
                               start_date=START,
                               end_date=END,
                               benchmark=BENCHMARK, days_before=1, days_after=10, top_liquid=500,
                               use_liquid_stocks=False, plot_error=False, price_data=price_data)

Best performing PAR quantile (4)

In [ ]:
q4 = earnings_announcements[earnings_announcements['par_5'] == RETURNS_QUANTILES - 1]

results = custom_event_study(q4, date_column='date',
                               start_date=START,
                               end_date=END,
                               benchmark=BENCHMARK, days_before=1, days_after=10, top_liquid=500,
                               use_liquid_stocks=False, plot_error=False, price_data=price_data)

Compared to the event study in the empirical tests section, the magnitude of price reversals are much more clear here as well as the spreads between the best and worst performing PAR quantiles.

Up till now, the research has given enough confidence that there is possibility for a viable trading strategy. You can find those results in the backtest.

</a>

Conclusion

So and Wang's paper identifies an interesting possibility: investors may be betting too heavily on their expectations, not adjusting properly when market makers bend a stock's price accordingly. There's room for a strategy to take advantage of the market makers' temporary price adjustment by betting against investors' expectations.

Based on 1996-2011 data, the authors found an average return rate of 1.448% per transaction with a 3-day PAR lookback, 2 day hold strategy. Using 2007-2016 data, we were able to find an average return rate of 0.449%, which can be improved to just over 0.6% with reasonable firm size/liquidity filters and a PAR lookback of 5.

As this attemps an OOS validation of the original strategy, there appears to be some alpha decay for this strategy. However, it raises important questions about the accuracy of investor expectations. Earnings announcements are just one kind of event, and perhaps this strategy can be expanded to other anticipated events such as economic reports. This kind of market inefficiency could be widespread, and thus very profitable for those willing to harness it (See Predicting Earnings with Corporate Actions).

References:

In [ ]:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
from pytz import utc
from odo import odo
import scipy
import math

from zipline.utils import tradingcalendar

from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.factors import Returns, AverageDollarVolume, CustomFactor, SimpleMovingAverage
from quantopian.pipeline.factors.eventvestor import BusinessDaysUntilNextEarnings
from quantopian.pipeline.filters.morningstar import IsPrimaryShare, Q500US, Q1500US
from quantopian.pipeline.data import morningstar as mstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.classifiers.morningstar import Sector

def create_spread_matrix(start_func=lambda n: 0):
    # Iterate through the PAR and hold length values; calculate the spread for each one.
    par_range = range(2, 11)
    hold_range = range(1, MAX_DAYS_TO_HOLD + 1)
    spread_matrix = pd.DataFrame(columns=par_range, index=hold_range, dtype=float)
    for par_lookback in par_range:
        for days_to_hold in hold_range:
            spread_matrix.loc[days_to_hold, par_lookback] = calc_spread(par_lookback, days_to_hold,
                                                                        start_func=start_func)
    return spread_matrix

def returns(sell_day, loc, start=0):
    # Here's a function that calculates average returns for a
    # specific hold length on any subset of the events.
    # Calculate returns for holding specified stocks (`loc`) until a specified day.
    price_data_subset = price_data_matched.loc[loc]
    by_stock = (price_data_subset[sell_day] / price_data_subset[start]) - 1
    return by_stock[(by_stock - by_stock.mean()).abs() <= 3 * by_stock.std()]

def calc_spread(par_lookback, sell_day, loc=True, start_func=lambda n: 0):
    # Calculate returns spread between first and last PAR quantiles.
    par_label = 'par_%d' % par_lookback
    return (returns(sell_day, (data[par_label] == 0) & loc, start_func(sell_day)).mean()
            - returns(sell_day, (data[par_label] == RETURNS_QUANTILES - 1) & loc, start_func(sell_day)).mean())

def fill_prices(df):
    # Get the desired pricing data for events that took place on the same day.
    day_loc = cal.get_loc(df.index.get_level_values(0)[0])
    prices = get_pricing(set(df.index.get_level_values(1)) | {BENCHMARK},
                         cal[day_loc + PRICE_DAY_OFFSETS[0]],
                         cal[day_loc + PRICE_DAY_OFFSETS[-1]],
                         fields='open_price').transpose()
    prices.columns = PRICE_DAY_OFFSETS
    return prices

def variable_lookback_returns(max_lookback, **kwargs):
    # Create a multiple-output returns custom factor, with an output for each lookback length.
    class VariableLookbackReturns(CustomFactor):
        inputs = [USEquityPricing.close]
        window_length = max_lookback
        outputs = map(lambda n: 'par_%d' % n, range(2, max_lookback + 1))
        def compute(self, today, assets, out, closes):
            for i in range(2, max_lookback + 1):
                out['par_%d' % i][:] = (closes[-1] / closes[-i]) - 1
    return VariableLookbackReturns(**kwargs)

def offset_returns_factor(days, **kwargs):
    # Creates a custom factor for returns n days ago.
    
    kwargs['window_length'] += days
    
    class OffsetReturns(CustomFactor):
        inputs = [USEquityPricing.close]
        def compute(self, today, assets, out, closes):
            out[:] = (closes[-days - 1] / closes[0]) - 1
    return OffsetReturns(**kwargs)

def split_run_pipeline(pipeline, start, end, segments):
    dts = np.arange(start, end, (end - start) / segments)
    if len(dts) == segments:
        dts = np.append(dts, end)
    return pd.concat(map(
        lambda i: run_pipeline(pipeline, dts[i], dts[i + 1] - pd.Timedelta('1d')),
        range(segments)
    ))

def filter_universe(min_price = 0., min_volume = 0.):  
    """
    Computes a security universe based on nine different filters:

    1. The security is common stock
    2 & 3. It is not limited partnership - name and database check
    4. The database has fundamental data on this stock
    5. Not over the counter
    6. Not when issued
    7. Not depository receipts
    8. Is Primary share
    9. Has high dollar volume
    
    Returns
    -------
    high_volume_tradable - zipline.pipeline.factor.Rank
        A ranked AverageDollarVolume factor that's filtered on the nine criteria
    """
    common_stock = mstar.share_class_reference.security_type.latest.eq('ST00000001')
    not_lp_name = ~mstar.company_reference.standard_name.latest.matches('.* L[\\. ]?P\.?$')
    not_lp_balance_sheet = mstar.balance_sheet.limited_partnership.latest.isnull()
    have_data = mstar.valuation.market_cap.latest.notnull()
    not_otc = ~mstar.share_class_reference.exchange_id.latest.startswith('OTC')
    not_wi = ~mstar.share_class_reference.symbol.latest.endswith('.WI')
    not_depository = ~mstar.share_class_reference.is_depositary_receipt.latest
    primary_share = IsPrimaryShare()
    
    # Combine the above filters.
    tradable_filter = (common_stock & not_lp_name & not_lp_balance_sheet &
                       have_data & not_otc & not_wi & not_depository & primary_share)

    price = SimpleMovingAverage(inputs=[USEquityPricing.close],
                                window_length=252, mask=tradable_filter)
    volume = SimpleMovingAverage(inputs=[USEquityPricing.volume],
                                 window_length=252, mask=tradable_filter)
    full_filter = tradable_filter & (price >= min_price) & (volume >= min_volume)
    
    high_volume_tradable = AverageDollarVolume(
            window_length=252,
            mask=full_filter
        ).rank(ascending=False)
    return high_volume_tradable

class SidFactor(CustomFactor):
    """
    Workaround to screen by sids in pipeline
    
    Credit: Luca
    """
    inputs = []  
    window_length = 1  
    sids = []

    def compute(self, today, assets, out):
        out[:] = np.in1d(assets, self.sids)

def get_liquid_universe_of_stocks(start_date, end_date, top_liquid=500):    
    """
    Gets the top X number of securities based on the criteria defined in
    `filter_universe`
    
    Parameters
    ----------
    start_date : string or pd.datetime
        Starting date for universe computation.
    end_date : string or pd.datetime
        End date for universe computation.
    top_liquid : int, optional
        Limit universe to the top N most liquid names in time period.
        Based on 21 day AverageDollarVolume
        
    Returns
    -------
    security_universe : list
        List of securities that match the universe criteria
    """
    pipe = Pipeline()
    pipe.add(AverageDollarVolume(window_length=1), 'liquidity')
    pipe.set_screen((filter_universe() < top_liquid))
    data = run_pipeline(pipe, start_date=start_date, end_date=end_date)

    security_universe = data.index.levels[1].unique().tolist()
    return security_universe

def get_cum_returns(prices, sid, date, days_before, days_after, benchmark_sid):
    """
    Calculates cumulative and abnormal returns for the sid & benchmark
    
    Parameters
    ----------
    prices : pd.DataFrame
        Pricing history DataFrame obtained from `get_pricing`. Index should
        be the datetime index and sids should be columns.
    sid : int or zipline.assets._assets.Equity object
        Security that returns are being calculated for.
    date : datetime object
        Date that will be used as t=0 for cumulative return calcuations. All
        returns will be calculated around this date.
    days_before, days_after : int
        Days before/after to be used to calculate returns for.
    benchmark :  int or zipline.assets._assets.Equity object
    
    Returns
    -------
    sid_returns : pd.Series
        Cumulative returns time series from days_before ~ days_after from date
        for sid
    benchmark_returns : pd.Series
        Cumulative returns time series for benchmark sid
    abnormal_returns : pd.Series
        Abnomral cumulative returns time series for sid compared against benchmark
    """

    day_zero_index = prices.index.searchsorted(date)
    starting_index = max(day_zero_index - days_before, 0)
    ending_index   = min(day_zero_index + days_after + 1, len(prices.index) - 1)

    if starting_index < 0 or ending_index >= len(prices.index):
        return None
    
    if sid == benchmark_sid:
        temp_price = prices.iloc[starting_index:ending_index,:].loc[:,[sid]]
    else:
        temp_price = prices.iloc[starting_index:ending_index,:].loc[:,[sid, benchmark_sid]]
            
    beta = calc_beta(sid, benchmark_sid, temp_price)
    if beta is None:
        return
    
    daily_ret = temp_price.pct_change().fillna(0)
    
    daily_ret['abnormal_returns'] = daily_ret[sid] - beta*daily_ret[benchmark_sid]
    
    cum_returns = (daily_ret + 1).cumprod() - 1
    
    try:
        # If there's not enough data for event study,
        # return None
        cum_returns.index = range(starting_index - day_zero_index,
                                  ending_index - day_zero_index)
    except:
        return None
    
    sid_returns      = cum_returns[sid] - cum_returns[sid].ix[0]
    bench_returns    = cum_returns[benchmark_sid] - cum_returns[benchmark_sid].ix[0]
    abnormal_returns = cum_returns['abnormal_returns'] - cum_returns['abnormal_returns'].ix[0]
    
    return sid_returns, bench_returns, abnormal_returns

def calc_beta(sid, benchmark, price_history):
    """
    Calculate beta amounts for each security
    
    Parameters
    ----------
    sid : int or zipline.assets._assets.Equity object
        Security that beta is being calculated for.
    benchmark : int or zipline.assets._assets.Equity object
        Benchmark that will be used to determine beta against
    price_history: pd.DataFrame
        DataFrame that contains pricing history for benchmark and
        sid. Index is a datetimeindex and columns are sids. Should 
        already be truncated for date_window used to calculate beta.
        
    Returns
    -------
    beta : float
        Beta of security against benchmark calculated over the time
        window contained in price_history
    """
    if sid == benchmark:
        return 1.0
    
    stock_prices = price_history[sid].pct_change().dropna()
    bench_prices = price_history[benchmark].pct_change().dropna()
    aligned_prices = bench_prices.align(stock_prices,join='inner')
    bench_prices = aligned_prices[0]
    stock_prices = aligned_prices[1]
    bench_prices = np.array( bench_prices.values )
    stock_prices = np.array( stock_prices.values )
    bench_prices = np.reshape(bench_prices,len(bench_prices))
    stock_prices = np.reshape(stock_prices,len(stock_prices))
    if len(stock_prices) == 0:
        return None
    regr_results = scipy.stats.linregress(y=stock_prices, x=bench_prices)  
    beta = regr_results[0]  
    p_value = regr_results[3]
    if p_value > 0.05:
        beta = 0.
    return beta  

def define_xticks(days_before, days_after):
    """
    Defines a neat xtick label axis on multipes of 2 using X days before
    and X days after.
    
    Parameters
    ----------
    days_before : int
        Positive integer detailing the numbers of days before event date
    days_after : int
        Postiive integer detailing the number of days after an event date
        
    Returns
    -------
    list : List of integers on multiples of 2 from [-days_before ~ days_after)
    """
    day_numbers = [i for i in range(-days_before+1, days_after)]
    xticks = [d for d in day_numbers if d%2 == 0]
    return xticks

def plot_distribution_of_events(event_data, date_column, start_date, end_date):
    """
    Plots the distribution of events
    
    Parameters
    ----------
    event_data : pd.DataFrame
        DataFrame that contains the events data with date and sid columns as
        a minimum. See interactive tutorials on quantopian.com/data
    date_column : String
        String that labels the date column to be used for the event. e.g. `asof_date`
    start_date, end_date : Datetime
        Start and end date to be used for the cutoff for the distribution plots
    """
    event_data = event_data[(event_data[date_column] > start_date) &
                            (event_data[date_column] < end_date)]
    s = pd.Series(event_data[date_column])
    
    sns.set_palette('coolwarm')
    distribution = s.groupby([s.dt.year, s.dt.month]).count()
    distribution.plot(kind="bar", grid=False,
                      color=sns.color_palette())
    plt.title("Distribution of events in time")
    plt.ylabel("Number of event")
    plt.xlabel("Date")
    return distribution
    
    
def plot_cumulative_returns(cumulative_returns, days_before, days_after):
    """
    Plots a cumulative return chart
    
    Parameters
    ----------
    cumulative_returns : pd.series
        Series that contains the cumulative returns time series from
        days_before ~ days_after from date for sid. See `get_cum_returns
    days_before, days_after : Datetime
        Positive integer detailing the numbers of days before/after event date
    """
    xticks = define_xticks(days_before, days_after)
    cumulative_returns.plot(xticks=xticks)
        
    plt.grid(b=None, which=u'major', axis=u'y')
    plt.title("Cumulative Return before and after event")
    plt.xlabel("Window Length (t)")
    plt.ylabel("Cumulative Return (r)")
    plt.legend(["N=%s" % cumulative_returns.name], loc='best')

def plot_cumulative_returns_with_error_bars(cumulative_returns, returns_with_error,
                                            days_before, days_after, abnormal=False):
    """
    Plots a cumulative return chart with error bars. Can choose between abnormal returns
    and simple returns
    
    Parameters
    ----------
    cumulative_returns : pd.Series
        Series that contains the cumulative returns time series from
        days_before ~ days_after from date for sid. See `get_cum_returns
    returns_with_error: pd.Series
        Series that contains the standard deviation of returns passed in through
        `cumulative_returns`. See `get_returns`
    days_before, days_after : Datetime
        Positive integer detailing the numbers of days before/after event date
    abnormal : Boolean, optional
        If True, will plot labels indicating an abnormal returns chart
    """
    xticks = define_xticks(days_before, days_after)
    returns_with_error.ix[:-1] = 0
    plt.errorbar(cumulative_returns.index, cumulative_returns, xerr=0, yerr=returns_with_error)
    plt.grid(b=None, which=u'major', axis=u'y')
    if abnormal:
        plt.title("Cumulative Abnormal Return before and after event with error")
    else:
        plt.title("Cumulative Return before and after event with error")
    plt.xlabel("Window Length (t)")
    plt.ylabel("Cumulative Return (r)")
    plt.legend()
    
def plot_cumulative_returns_against_benchmark(cumulative_returns,
                                              benchmark_returns,
                                              days_before, days_after):
    """
    Plots a cumulative return chart against the benchmark returns
    
    Parameters
    ----------
    cumulative_returns, benchmark_returns : pd.series
        Series that contains the cumulative returns time series from
        days_before ~ days_after from date for sid/benchmark. See `get_cum_returns`
    days_before, days_after : Datetime
        Positive integer detailing the numbers of days before/after event date
    """
    xticks = define_xticks(days_before, days_after)
    cumulative_returns.plot(xticks=xticks, label="Event")
    benchmark_returns.plot(xticks=xticks, label='Benchmark')
    
    plt.title("Benchmark cum returns versus Event")
    plt.ylabel("% Cumulative Return")
    plt.xlabel("Time Window")
    plt.legend(["Event", 'Benchmark'], loc='best')
    plt.grid(b=None, which=u'major', axis=u'y')
    
def plot_cumulative_abnormal_returns(cumulative_returns,
                                     abnormal_returns,
                                     days_before, days_after):
    """
    Plots a cumulative return chart against the abnormal returns
    
    Parameters
    ----------
    cumulative_returns, abnormal_returns : pd.series
        Series that contains the cumulative returns time series against abnormal returns
        from days_before ~ days_after from date for sid. See `get_cum_returns`
    days_before, days_after : Datetime
        Positive integer detailing the numbers of days before/after event date
    """
    xticks = define_xticks(days_before, days_after)
    abnormal_returns.plot(xticks=xticks, label="Abnormal Average Cumulative")
    cumulative_returns.plot(xticks=xticks, label="Simple Average Cumulative")
    
    plt.axhline(y=abnormal_returns.ix[0], linestyle='--', color='black', alpha=.3)
    plt.axhline(y=abnormal_returns.max(), linestyle='--', color='black', alpha=.3)
    plt.title("Cumulative Abnormal Returns versus Cumulative Returns")
    plt.ylabel("% Cumulative Return")
    plt.xlabel("Time Window")
    plt.grid(b=None, which=u'major', axis=u'y')
    plt.legend(["Abnormal Average Cumulative","Simple Average Cumulative"], loc='best')
    
def get_returns(event_data, benchmark, date_column, days_before, days_after,
                use_liquid_stocks=False, top_liquid=1000, price_data=None):
    """
    Calculates cumulative returns, benchmark returns, abnormal returns, and
    volatility for cumulative and abnomral returns
    
    Parameters
    ----------
    event_data : pd.DataFrame
        DataFrame that contains the events data with date and sid columns as
        a minimum. See interactive tutorials on quantopian.com/data
    benchmark : string, int, zipline.assets._assets.Equity object
        Security to be used as benchmark for returns calculations. See `get_returns`
    date_column : String
        String that labels the date column to be used for the event. e.g. `asof_date`
    days_before, days_after : Datetime
        Positive integer detailing the numbers of days before/after event date
    use_liquid_stocks : Boolean
        If set to True, it will filter out any securities found in `event_data`
        according to the filters found in `filter_universe`
    top_liquid : Int
        If use_liquid_stocks is True, top_liquid determines the top X amount of stocks
        to return ranked on liquidity
        
        
    Returns
    -------
    cumulative_returns, benchmark_returns, abnormal_returns
    returns_volatiliy, abnormal_returns_volatility : pd.Series
    valid_sids: list
        Used to graph distribution of events (in case of use_liquid_stocks flag)
    """
    cumulative_returns = []
    benchmark_returns = []
    abnormal_returns = []
    valid_sids = []
    liquid_stocks = None
    
    print "Running Event Study"
    for i, row in event_data[['sid', date_column]].iterrows():
        sid, date = row
        
        # Getting 10 extra days of data just to be sure
        extra_days_before = math.ceil(days_before * 365.0/252.0) + 10
        start_date = date - timedelta(days=extra_days_before)
        extra_days_after = math.ceil(days_after * 365.0/252.0) + 10
        end_date   = date + timedelta(days=extra_days_after)

        if use_liquid_stocks:
            if liquid_stocks is None:
                liquid_stocks = get_liquid_universe_of_stocks(date, date, top_liquid=top_liquid)
            if sid not in liquid_stocks:
                continue
                
        valid_sids.append(sid)

        # duplicated columns would break get_cum_returns
        pr_sids = set([sid, benchmark])
        if price_data is None:
            prices = get_pricing(pr_sids, start_date=start_date,
                                 end_date=end_date, fields='open_price')
            prices = prices.shift(-1)
        else:
            prices = price_data.loc[date].loc[pr_sids].transpose()
            date = 0
        if date in prices.index:
            results = get_cum_returns(prices, sid, date, days_before, days_after, benchmark)
            if results is None:
                print "Discarding event for %s on %s" % (symbols(sid),date)
                continue
            sid_returns, b_returns, ab_returns = results
            cumulative_returns.append(sid_returns)
            benchmark_returns.append(b_returns)
            abnormal_returns.append(ab_returns)
            
    sample_size = len(cumulative_returns)
    returns_volatility          = pd.concat(cumulative_returns, axis=1).std(axis=1)
    abnormal_returns_volatility = pd.concat(abnormal_returns,   axis=1).std(axis=1)
    benchmark_returns           = pd.concat(benchmark_returns,  axis=1).mean(axis=1)
    abnormal_returns            = pd.concat(abnormal_returns,   axis=1).mean(axis=1)
    cumulative_returns          = pd.concat(cumulative_returns, axis=1).mean(axis=1)
    cumulative_returns.name = sample_size
        
    return (cumulative_returns, benchmark_returns, abnormal_returns,
            returns_volatility, abnormal_returns_volatility, valid_sids)
    
def custom_event_study(event_data, date_column='asof_date',
                       start_date='2007-01-01', end_date='2014-01-01',
                       benchmark=None, days_before=10, days_after=10, top_liquid=500,
                       use_liquid_stocks=True, plot_error=False, price_data=None):
    """
    Calculates simple & cumulative returns for events and plots stock price movement
    before and after the event date.
    
    Parameters
    ----------
    event_data : pd.DataFrame
        DataFrame that contains the events data with date and sid columns as
        a minimum. See interactive tutorials on quantopian.com/data
    date_column : String
        String that labels the date column to be used for the event. e.g. `asof_date`
    start_date, end_date : Datetime
        Start and end date to be used for the cutoff for the evenet study
    benchmark : int or zipline.assets._assets.Equity object
        Security to be used as benchmark for returns calculations. See `get_returns`
    days_before, days_after : int
        Days before/after to be used to calculate returns for.
    top_liquid : Int
        If use_liquid_stocks is True, top_liquid determines the top X amount of stocks
        to return ranked on liquidity
    use_liquid_stocks : Boolean
        If set to True, it will filter out any securities found in `event_data`
        according to the filters found in `filter_universe`
    """
    if date_column not in event_data or not isinstance(event_data, pd.DataFrame) or 'sid' not in event_data:
        raise KeyError("event_data not properly formatted for event study. Please make sure " \
                       "date_column and 'sid' are both present in the DataFrame")

    if isinstance(benchmark, str):
        raise TypeError("Benchmark must be an equity object. Please use symbols('ticker') to" \
                        "set your benchmark")
        
    if benchmark is None:
        benchmark = symbols('SPY')
        
    print "Formatting Data"

    start_date = pd.to_datetime(start_date)
    end_date = pd.to_datetime(end_date)
    event_data = event_data[(event_data[date_column] > start_date) &
                            (event_data[date_column] < end_date)]

    event_data.sid = event_data.sid.apply(lambda x: int(x))
    
    print "Getting Plots"
    cumulative_returns, benchmark_returns, abnormal_returns, returns_volatility, \
        abnormal_returns_volatility, valid_sids = get_returns(event_data, benchmark, date_column,
                                                              days_before, days_after,
                                                              use_liquid_stocks=use_liquid_stocks,
                                                              top_liquid=top_liquid,
                                                              price_data=price_data)
    event_data = event_data[event_data.sid.isin(valid_sids)]

    plt.subplot(5, 2, 1)
    distribution = plot_distribution_of_events(event_data, date_column, start_date, end_date)

    plt.subplot(5, 2, 2)
    plot_cumulative_returns(cumulative_returns, days_before, days_after)
    
    plt.subplot(5, 2, 5)
    plot_cumulative_returns_against_benchmark(cumulative_returns, benchmark_returns,
                                              days_before, days_after)
    
    plt.subplot(5, 2, 6)
    plot_cumulative_abnormal_returns(cumulative_returns, abnormal_returns,
                                     days_before, days_after)
    
    if plot_error:
        plt.subplot(5, 2, 9)
        plot_cumulative_returns_with_error_bars(cumulative_returns, returns_volatility,
                                                days_before, days_after)

        plt.subplot(5, 2, 10)
        plot_cumulative_returns_with_error_bars(cumulative_returns, abnormal_returns_volatility,
                                                days_before, days_after, abnormal=True)

    return cumulative_returns, benchmark_returns, abnormal_returns, returns_volatility, \
        abnormal_returns_volatility, valid_sids, distribution