Notebook

Building the Foundations for Hypothesis Testing

By Matt Lee

Introduction

> This tutorial shows you step-by-step how I collected all the data that I'll be using to create a research piece based off the authors' original white paper. Specifically, this article focuses on generating data for evaluating a strategy proposed in a paper by Milian (2013).

In many cases, gathering data is the hardest part of running an analysis. Understanding what sample selection, universe, and variables you're working with will set the foundation for the rest of your research.

Whether you've been following the Quantpedia Series or want to learn how to conduct your own research, this tutorial will show you how to use Pipeline to extract the data you need. The tutorial here is based off a research piece that we're analyzing for the Quantpedia Trading Strategy Series.

Hypothesis Overview

The hypothesis I am testing is investigated in a research paper by Milian in 2013. In his paper, Milian examines the well-known Post Earnings Announcement Drift Effect (PEAD for short).

The PEAD is the tendency for returns to correlate with earnings announcements, even months after the news. This means that a positive earnings surprise is a strong indicator of a 60-90 day positive trend for a company's stock. This is somewhat counter intuitive - one would expect that earnings announcement news is priced directly into the value of the security on the day of announcement.

As a result, many arbitrageurs have built trading strategies which seek to exploit the PEAD. Investors who have built strategies utilizing the PEAD typically go long on firms which have the biggest positive earnings surprises, and short on firms which have the biggest negative earning's surprises over a period of 2-3 months following the news.

However, recent research speculates that the PEAD has been reversed in past years, due to the overcrowding of arbitrageurs invested in PEAD strategies. In other words, firms that provided the biggest positive earnings announcement surprise have significant negative returns shortly after their subsequent earnings announcement.

There are a couple possible explanations for this reversal - which I'll save for the analysis article.

So! After reading some papers, I've come to a specific hypothesis I want to test - whether the PEAD effect has indeed reversed in recent years. If it has, we can potentially leverage this anomaly into a successful trading strategy. This hypothesis can also open up questions about other well known cross sectional anomalies - if the PEAD has reversed, perhaps others have as well!

Data

In order to evaluate this strategy, I need three key pieces of data:

1. Each day, I need to collect relevant information on stocks 1 business day away from a business announcement
2. For each stock in (1), I need to compute its forward returns 3 days forward
3. For each calendar quarter in our sample period, I must calculate the decile cutoffs for earnings surprises. 

Let's get started! First things first - we need to import all the necessary functions and libraries. We'll be using data from the Zack's Earning Surprises and EventVestor's Earnings Calendar datasets.

Currently, the pipeline doesn't support batching. In order to ensure we can generate all the data we need without going over the research environment's memory limits, we need to run our pipelines in chunks. Let's make a function for doing that

In [15]:
# Split so we don't go over memory limits
def split_run_pipeline(pipeline, start, end, segments):
    dts = np.arange(start, end, (end - start) / segments)
    if len(dts) == segments:
        dts = np.append(dts, end)
    return pd.concat(map(
        lambda i: run_pipeline(pipeline, dts[i], dts[i + 1] - pd.Timedelta('1d')),
        range(segments)
    ))

Now, we need to create a mask which screens out stocks which don't fit our universe criteria. In the original paper, Milian found the strongest correlation between earnings surprise/returns in large market-cap stocks with actively traded options. Since Quantopian doesn't yet have options data, we can use the Q500US universe as a substitute (Large cap stocks like the ones found in the Q500US will most likely have active exchange-traded options).

In [16]:
base_universe = Q500US()

Using this mask, we can gather the data for the first item on our list. On each day, we want stocks which have earnings announcements the next day, along with their previous earnings surprise values, and other potentially relevant information such as 30 day average dollar volume, sector, and price. We can easily do this in pipeline. Note that LagEaSurp corresponds to the earnings surprise of the stock in the previous quarter.

In [17]:
def create_positions_pipeline():
    # filters
    has_earnings_announcement = BusinessDaysUntilNextEarnings().eq(1)
    
    # factors
    price = USEquityPricing.open.latest
    lag_ea_surp = EarningsSurprises.eps_pct_diff_surp.latest
    dollar_volume = AverageDollarVolume(window_length=30)
    sector = Sector()

    return Pipeline(columns= {
        'LagEaSurp': lag_ea_surp,
        'Current Price': price, 
        'Earnings Announcement': EarningsCalendar.next_announcement.latest,
        'Average Dollar Volume': dollar_volume,
        'Sector': sector}, 
                    screen=has_earnings_announcement & base_universe)

Cool. Now let's set the start and end dates for our sample period, along with the amount of chunks we want the pipeline to run in. In consideration of time, let's just run it over 1 month and 1 split initially.

In [18]:
START = pd.Timestamp('01-01-2013')
END = pd.Timestamp('02-01-2013')
SPLITS = 1

Now we can run instantiate the pipeline, and run it over our sample period! When you run a pipeline, it returns a Pandas Dataframe object. We use dropna() on the dataframe to drop any null values (from gaps in our dataset).

In [19]:
positions_pipeline = create_positions_pipeline()
positions_data = split_run_pipeline(positions_pipeline, START, END, SPLITS)
positions_data.dropna(inplace=True)

Let's get a peek of what that data in our Pandas dataframe looks like.

In [20]:
positions_data.head(3)
Out[20]:
Average Dollar Volume Current Price Earnings Announcement LagEaSurp Sector
2013-01-07 00:00:00+00:00 Equity(22140 [MON]) 1.839306e+08 95.56 2013-01-08 0.00 101
Equity(24829 [APOL]) 3.839033e+07 21.80 2013-01-08 6.12 205
2013-01-08 00:00:00+00:00 Equity(24873 [STZ]) 4.164929e+07 36.52 2013-01-09 31.48 205

As you can see, we have a multi-indexed Dataframe. Each day, we can see the equities with an Earnings Announcement the next day, along with their LagEaSurp.

Now we can move on to the next item in our data collection list - generating the returns 2 days after the earnings announcement for each stock in our table.

You might be wondering why we couldn't generate the returns data in the previous Pipeline. The reason we couldn't is that the Pipeline can only look backward in time. On each day, our Pipeline captures a point in time snapshot of these stocks. It would be impossible for it to see the returns of those same stocks 3 days in the future in the context of the same pipeline!

So, let's create another pipeline, which captures stocks in our base universe 2 days after their latest earnings announcement. In this pipeline, we'll keep track of the Earnings Announcement Date, so we can align our results with the previous Dataframe. We will also capture the returns of the stock for the past 3 days, which tracks the movement of the stock price during our hold period.

In [21]:
def create_returns_pipeline():
    three_day_returns = Returns(mask=Q500US(), inputs=[USEquityPricing.open], window_length=3)
    had_earnings = BusinessDaysSincePreviousEarnings().eq(2)
    return Pipeline(columns={
        'Returns 2 days after': three_day_returns,
        'Earnings Announcement': EarningsCalendar.previous_announcement.latest
                    },
                    screen=had_earnings & base_universe)

Let's generate the data and have a peek.

In [22]:
returns_pipeline =  create_returns_pipeline()
returns_data = split_run_pipeline(returns_pipeline, START, END, SPLITS)
returns_data.head(3)
Out[22]:
Earnings Announcement Returns 2 days after
2013-01-07 00:00:00+00:00 Equity(2760 [FDO]) 2013-01-03 -0.123567
2013-01-08 00:00:00+00:00 Equity(41462 [MOS]) 2013-01-04 0.010936
2013-01-10 00:00:00+00:00 Equity(2 [AA]) 2013-01-08 0.005388

On each day, we have information about stocks that are 2 business days since their latest earnings announcement.

Now, we need to add the Returns 2 days after column to our first pipeline. The problem, however, is we cannot use the default indexes to combine the data frame. In our initial dataframe we want the 3 day return information in the same row as the stock information . In other words, we actually want to combine the columns based on Earnings Announcement and Equity, instead of the default DateTime + Equity.

We can do this by reindexing both dataframes before we combine them.

First, let's do the positions_data frame. We start by resetting the current index. After we unindex everything, the current indexes tracking date and equity name turn into columns named 'level_0' and 'level_1' respectively, - so we rename them.

In [23]:
positions_data.reset_index(inplace=True)
positions_data.rename(columns= {
                 'level_0': 'Current Day',
                 'level_1': 'Equity'
                 },
                 inplace=True)
positions_data.head(2)
Out[23]:
Current Day Equity Average Dollar Volume Current Price Earnings Announcement LagEaSurp Sector
0 2013-01-07 00:00:00+00:00 Equity(22140 [MON]) 1.839306e+08 95.56 2013-01-08 0.00 101
1 2013-01-07 00:00:00+00:00 Equity(24829 [APOL]) 3.839033e+07 21.80 2013-01-08 6.12 205

Now, we can set a multiindex using Earnings Announcement first, equity second.

In [24]:
positions_data.set_index(['Earnings Announcement', 'Equity'], inplace=True)
positions_data.head(2)
Out[24]:
Current Day Average Dollar Volume Current Price LagEaSurp Sector
Earnings Announcement Equity
2013-01-08 Equity(22140 [MON]) 2013-01-07 00:00:00+00:00 1.839306e+08 95.56 0.00 101
Equity(24829 [APOL]) 2013-01-07 00:00:00+00:00 3.839033e+07 21.80 6.12 205

We can repeat the process on the returns data.

In [25]:
returns_data.reset_index(inplace=True)

# We don't care about the date used as the previous index
del returns_data['level_0']
returns_data.rename(columns ={'level_1': 'Equity'}, inplace=True)
returns_data.set_index(['Earnings Announcement', 'Equity'], inplace=True)
returns_data.head(2)
Out[25]:
Returns 2 days after
Earnings Announcement Equity
2013-01-03 Equity(2760 [FDO]) -0.123567
2013-01-04 Equity(41462 [MOS]) 0.010936

Our data is nicely formatted and ready for concatenation! We'll combine positions_data and returns_data into a dataframe called stock_data

In [26]:
stock_data = returns_data.join(positions_data)
stock_data.dropna(inplace=True)
stock_data.head(2)
Out[26]:
Returns 2 days after Current Day Average Dollar Volume Current Price LagEaSurp Sector
Earnings Announcement Equity
2013-01-08 Equity(22140 [MON]) 0.024255 2013-01-07 00:00:00+00:00 1.839306e+08 95.56 0.00 101.0
Equity(24829 [APOL]) -0.130672 2013-01-07 00:00:00+00:00 3.839033e+07 21.80 6.12 205.0

Great. The final missing piece of data is the decile each LagEaSurp value falls in. Normally, we would've been able to calculate the deciles in our original pipeline through a classifier. However if we calculated the deciles in the original pipeline, the deciles would be recomputed every day. This means that, unlike the original study, our deciles would not be constant per calendar quarter. So, we have to make an extra pipeline for calculating the deciles on a per-quarter basis. Let's start with the pipeline.

In [27]:
def create_deciles_pipeline():    
    mask = Q500US()
    lag_e_surp = EarningsSurprises.eps_pct_diff_surp.latest
    decile = lag_e_surp.deciles(mask=mask)
    date = EarningsSurprises.asof_date.latest
    
    return Pipeline(columns= {
        'LagESurp Decile': decile,
        'LagESurp': lag_e_surp,
        'Earnings Announcement Date' : date
                    },
                    screen=mask)

Now, we need to make some functions which will help us populate the deciles per quarter. First, we need a function which runs the pipeline on a quarterly basis. This function will run our pipeline for one day at the start of every quarter between START and END.

In [28]:
def run_pipeline_freq(start, end, pipeline):
    '''
    Runs a pipeline given a pandas datetime frequency like "Q" 
    '''
    quarters = pd.date_range(start, end, freq='QS')
    quarters = quarters.tolist()
    if start not in quarters:
        start = start - pd.tseries.offsets.QuarterOffset()
        quarters.insert(0, start)
        
    output = None
    return pd.concat(map(
        lambda i: run_pipeline(pipeline, quarters[i],quarters[i]),
        range(len(quarters))
    ))
In [29]:
decile_data = run_pipeline_freq(START, END, create_deciles_pipeline())
decile_data = decile_data.dropna()
decile_data.head(2)
Out[29]:
Earnings Announcement Date LagESurp LagESurp Decile
2013-01-02 00:00:00+00:00 Equity(24 [AAPL]) 2012-10-26 -2.03 2
Equity(62 [ABT]) 2012-10-18 1.56 4

Next, we need to compute the decile cutoffs for each quarter in our dataframe. For this, we'll use two functions, one function which returns a string representation of a quarter, and the other which stores a dict in the form of {Quarter: Decile_cutoffs}

In [30]:
def get_quarter(date):
    date = pd.Timestamp(date)
    return "Q{}{}".format(str(date.quarter), str(date.year))

def compute_deciles(decile_data):
    decile_data = decile_data.reset_index()
    quarters = decile_data['level_0'].unique()
    quarterly_deciles = {}
    
    for quarter in quarters:
        quarter_str = get_quarter(quarter)
        quarter_data = decile_data[decile_data['level_0'] == quarter]
        deciles = []
        for decile in range(0, 9):
            deciles.append(quarter_data[quarter_data['LagESurp Decile'] == decile]['LagESurp'].max())

        quarterly_deciles[quarter_str] = deciles
        
    return quarterly_deciles

So at any given quarter in our sample period, we have the array of decile cutoffs.

In [31]:
deciles = compute_deciles(decile_data)
print deciles['Q12013']
[-10.17, -2.6299999999999999, 0.0, 1.1599999999999999, 2.3300000000000001, 4.0, 6.8200000000000003, 11.94, 20.0]

Now, let's roll down the original stock_data dataframe, and for each row, we'll classify the stock's earning surprise into the decile given by the quarterly cutoffs we've just generated. Before we do that, we'll add a quarter column to our stock_info dataframe to make things simpler.

In [32]:
# add quarter column
stock_data['Quarter'] = [get_quarter(x) for x in stock_data['Current Day']]

# add decile for each stock in our dataframe
for idx, row in stock_data.iterrows():
    cutoffs = deciles[row['Quarter']]
    lag_ea_surp = row['LagEaSurp']
    dec = bisect.bisect_left(cutoffs, lag_ea_surp)
    stock_data.set_value(idx, 'Decile', dec)
In [33]:
stock_data = stock_data.reset_index()
stock_data.head(2)
Out[33]:
Earnings Announcement Equity Returns 2 days after Current Day Average Dollar Volume Current Price LagEaSurp Sector Quarter Decile
0 2013-01-08 Equity(22140 [MON]) 0.024255 2013-01-07 00:00:00+00:00 1.839306e+08 95.56 0.00 101.0 Q12013 2.0
1 2013-01-08 Equity(24829 [APOL]) -0.130672 2013-01-07 00:00:00+00:00 3.839033e+07 21.80 6.12 205.0 Q12013 6.0

And there it is! That is all the data we set out to collect. As you can see, sometimes it isn't possible to generate all the data required for analysis in one pipeline in the research environment. However, you can get creative with the days you harvest the data from available datasets. My analysis of this strategy using the data generated in this notebook will be covered in a notebook soon.

This research is published in partnership with Quantpedia, an online resource for discovering new trading ideas.

You can view the full Quantpedia series in the library along with other research and strategies.

Whitepaper author: Jonathon A. Milian

Whitepaper source: https://business.fiu.edu/academic-departments/accounting/pdf/Milian-Overreacting-to-History-March-2013.pdf

In [34]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
import scipy
import alphalens

from zipline.utils import tradingcalendar

from quantopian.pipeline import CustomFactor, Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import AverageDollarVolume
from quantopian.pipeline.factors import Returns
from quantopian.pipeline.filters.morningstar import default_us_equity_universe_mask
from quantopian.pipeline.filters.morningstar import Q500US, Q1500US
from quantopian.pipeline.classifiers.morningstar import Sector

from quantopian.pipeline.data.zacks import EarningsSurprises
from quantopian.pipeline.data.accern import alphaone
from quantopian.pipeline.factors.eventvestor import (
BusinessDaysUntilNextEarnings,
BusinessDaysSincePreviousEarnings
)
from quantopian.pipeline.data.eventvestor import EarningsCalendar

from odo import odo
import blaze as bz
from datetime import timedelta
from pytz import utc

import bisect