PsychSignal Series: Research Design¶

Research design is a fundamental and often over-looked part of the algorithm creation process. It is the simple point where any grounded quant will ask themselves, "What is my universe?"; and "What is my training/testing dataset?" These two simple questions lay a solid framework from which to frame and validate results from factor research and backtesting.

Here's an overview of what you'll learn from this notebook:

How to breakdown a list of securities by liquidity baskets
How to set guidelines for your universe of securities based on capital base and data coverage
How to validate your universe constraints with your in-sample datasets.

And in terms of pacing, the bolded section is where you are now:

Introduction - Examining the data. My goal here is to simply look at the dataset and understand what it looks like. I’ll be answering simple questions like, “How many stocks are covered?”; “Which sectors have the most coverage?”; and “What’s the distribution of sentiment scores?”. These are very basic but fundamentally important questions that lay the groundwork for all further development.
Research Design - Here, I’ll be setting up my environment for hypothesis testing define my in and out-of-sample datasets both cross-sectionally and through liquidity thresholds.
Hypothesis Testing - This is where I’ll be setting up a number of different hypotheses for my data and testing them through event studies and cross-sectional studies. The Factor Tearsheet and Event Study notebooks will be used heavily. The goal is to develop an alpha factor to use for strategy creation.
Strategy Creation - After I’ve developed a hypothesis and seen that it holds up consistently over different liquidity and sector partitions in my in-sample dataset, I’ll finally begin the process of developing my trading strategy. I’ll be asking questions like “Is my factor strong enough by itself?”; “What is its correlation with other factors?”. Once these questions have been answered, the trading strategy will be constructed and I’ll move onto the next section
Out-Of-Sample Test - Here, my main goal is to verify the work of steps 1~4 with my out-of-sample dataset. It will involve repeating many of the steps in 2~4 as well the use of the backtester (notice how only step 5 involves the backtester)

# Importing Data
from __future__ import division
from random import randint
import numpy as np
import pandas as pd
import scipy as sp
import pyfolio as pf
import matplotlib.pyplot as plt
import seaborn as sns
from quantopian.pipeline import Pipeline
from quantopian.pipeline import CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import Latest
from itertools import chain
from datetime import date
from blaze import by, merge
from odo import odo

SECTOR_NAMES = {
 101: 'Basic Materials',
 102: 'Consumer Cyclical',
 103: 'Financial Services',
 104: 'Real Estate',
 205: 'Consumer Defensive',
 206: 'Healthcare',
 207: 'Utilities',
 308: 'Communication Services',
 309: 'Energy',
 310: 'Industrials',
 311: 'Technology' ,
}

# Plotting colors
c = "#38BB86"

# Fundamentals for sector data
fundamentals = init_fundamentals()

# Importing our Data/Sample Version 
# PsychSignal's Twitter & StockTwits with Retweets sample is available from 24 Aug 2009 - 09 Jan 2016
from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset
# from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits as dataset

Analyzing Liquidity Baskets¶

In the previous notebook, I looked at PsychSignal's data through a number of different dimensions. One of the major parts that I didn't do was to break down the universe by liquidity baskets. This is an important distinction, especially for any sort of social media or news sentiment analysis, because stock popularity and liquidity often go hand-in-hand. I.E. AAPL is one of the most liquid securities in the world, but you've probably never heard of EXAR

So that's what I'm going to do now and use that information, along with everything from part 1, to guide my universe choices.

"""
Breaking down my datasets by year
"""
# Defining my yearly range
years = range(2011, 2016)
def get_dataset_for_year(dataset, year):
    start_date = pd.to_datetime(date(year, 1, 1))
    end_date = pd.to_datetime(date(year, 12, 31))
    dataset = dataset[dataset.asof_date >= start_date]
    dataset = dataset[dataset.asof_date <= end_date]
    return dataset

# Getting my `timed_datasets` which I'll use later on for different plots
timed_datasets = {year: get_dataset_for_year(dataset, year) for year in years}

"""
Using Pipeline to breakdown securities by liquidity baskets
"""

class Liquidity(CustomFactor):   
    """
    Get the average 252 day ADV
    """
    inputs = [USEquityPricing.volume, USEquityPricing.close] 
    window_length = 252

    def compute(self, today, assets, out, volume, close): 
        out[:] = (volume * close).mean(axis=0)

def get_liquidity_for_year(top_breakdown, bot_breakdown,
                           year):
    """
    Returns a DataFrame of stocks for the year belonging to
    a liquidity ranking of [bot_breakdown, top_breakdown)
    """
    liquidity = Liquidity()
    liquidity_rank = liquidity.rank(ascending=False)

    # Using a screen to only get the breakdowns we need
    ok_universe = (top_breakdown > liquidity_rank) & (bot_breakdown <= liquidity_rank)
    pipe = Pipeline()
    pipe.add(liquidity, 'liquidity')
    pipe.set_screen(ok_universe)
    
    # Only use the last day of the year since 252 trading days are already used
    get_date = '%s-12-31' % year
    factor = run_pipeline(pipe, start_date=get_date, end_date=get_date)
    
    # Rename for plotting purposes
    factor = factor.reset_index().drop(['level_0'], axis=1).rename(columns={'level_1': 'symbols'})
    factor['symbols'] = factor['symbols'].apply(lambda x: x.symbol)
    factor['basket'] = top_breakdown
    factor['year'] = year
    return factor

liquidity_baskets = {} # This will become a dict of {year: DataFrame}
liquidity_breakdowns = range(500, 8500, 500)
for year in years:
    for i, breakdown in enumerate(liquidity_breakdowns):
        try:
            basket = get_liquidity_for_year(breakdown, breakdown - 500,
                                            year)
        except:
            continue

        if i == 0:
            df = basket
        else:
            df = df.append(basket, ignore_index=True)
    liquidity_baskets[year] = df

Now that I have my liquidity baskets per year, let's see what they look like.

liquidity_baskets[2011]

So now that I have code to assign liquidity baskets, I want to look at the number of observations per liquidity basket. This is an important step because while the data has already been broken down into sectors, there are different liquidity baskets within each sector. So looking at both liquidity and sector breakdowns will help me make clear and strong choices about the universe.

def get_avg_observations_per_security(dataset):
    avg_per_security = by(dataset.symbol,
                          observations=dataset.asof_date.count())
    avg_per_security = odo(avg_per_security, pd.DataFrame)
    return avg_per_security

# Get number of observations per liquidity basket
obs_per_sec = {}
for year, data in timed_datasets.iteritems():
    df = get_avg_observations_per_security(data)
    df = pd.merge(df, liquidity_baskets[year], left_on=['symbol'], right_on=['symbols'])
    df = df.reset_index()
    sec_df = pd.pivot_table(df,
                       index='basket',
                       values='observations',
                       aggfunc=np.sum)
    obs_per_sec[year] = sec_df

pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Covered Per Basket")
plt.legend(loc='upper left')
plt.ylabel("Number of Observations")
plt.xlabel("Year")

<matplotlib.text.Text at 0x7f1fb2361910>

The graph shows that the highest liquidity baskets get the most coverage from PsychSignal. This is not surprising. The relationship between popularity and stock liquidity makes intuitive sense. Popular stocks get covered more often in news/social media, people end up trading the stocks they know about, and the cycle continues.

What's important to take away from this is that liquidity baskets below a threshold end up getting less coverage from PsychSignal. Given that my end goal is to have a viable, practical trading strategy where orders are filled consistently, I want stable PsychSignal coverage for the securities that I include in my universe. What good is it if a stock only has PsychSignal data for one day out of the year but is null the rest? Having a universe of securities that are well covered by the signal (PsychSignal's data) I'm trying to evaluate is important in designing my algorithm.

With that in mind, it's time to define my universe.

Defining The Universe¶

The first step is to know my capital base. As a good starting point, I'm going to use 1,000,000. By defining a capital base early, it allows me to set a minimum liquidity threshold. In trading terms, one million dollars isn't a lot of money, but sending an order for $500,000 of an illiquid security is only going to end in disaster. So to prevent order fulfillment issues, I'm going to set a minimum ADV (Average Daily Trading Volume) for my universe.

That along with the objective of having clear and consistent PsychSignal security coverage, I'm going to create a set of filters for my universe that will achieve these two goals:

Exclude all securities that belong to a liquidity basket greater than 3501.
Exclude all securities belonging to the Real Estate, Consumer Defensive, Communication Services, and Utilities Sectors (104, 205, 207, 308)
Exclude all securities with an ADV (average daily volume) < $10,000,000

Just so I know that these filters won't narrow my universe too much, let's look at the number of securities that I'll be working with year-over-year.

def filter_liquidity(year):
    """
    ADV > 10,000,000
    Liquidity basket < 3501
    """
    liquidity = Liquidity()
    liquidity_rank = liquidity.rank(ascending=False)
    ok_universe = (liquidity_rank < 3501)

    pipe = Pipeline()
    pipe.add(liquidity, 'liquidity')
    pipe.add(liquidity_rank, 'liquidity_rank')
    pipe.set_screen(ok_universe)
    get_date = '%s-12-31' % year
    factor = run_pipeline(pipe, start_date=get_date, end_date=get_date)
    factor = factor.reset_index().drop(['level_0'], axis=1).rename(columns={'level_1': 'symbols'})
    factor['symbols'] = factor['symbols'].apply(lambda x: x.symbol)
    factor['year'] = year
    facotr = factor[factor['liquidity'] > 10e7]
    return factor

def filter_sector(year):
    """
    Remove Real Estate, Consumer Defensive, Communication Services,
    and Utilities Sectors (104, 205, 207, 308)
    """
    sectors = get_fundamentals(query(fundamentals.asset_classification.morningstar_sector_code),
                               pd.to_datetime("%s-12-31" % year))
    sectors = sectors.T
    sectors.index = [s.symbol for s in sectors.index]
    sectors = sectors.reset_index().rename(columns={"index": "symbols"})
    sectors['year'] = year
    sectors = sectors[~sectors['morningstar_sector_code'].isin([104, 205, 207, 308])]
    return sectors


def get_security_count_after_filter(dataset, year):
    liquidity_universe = filter_liquidity(year)
    sector_universe = filter_sector(year)
    liq_and_sec = pd.merge(liquidity_universe,
                           sector_universe,
                           left_on=['symbols'],
                           right_on=['symbols'])
    dataset = dataset[dataset.symbol.isin(liq_and_sec['symbols'].unique().tolist())]
    dataset = dataset[dataset.asof_date <= pd.to_datetime("%s-12-31" % year)]
    dataset = dataset[dataset.asof_date >= pd.to_datetime("%s-01-01" % year)]
    return len(dataset.symbol.distinct())

yearly_securities = pd.Series(
    {year: get_security_count_after_filter(dataset, year) for year in years})

yearly_securities.plot(kind='bar',
                       color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year In Universe")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()

yearly_securities = pd.Series(
    {year: get_security_count_after_filter(dataset, year) for year in years})

yearly_securities.plot(kind='bar',
                       color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year In Universe")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()

<matplotlib.legend.Legend at 0x7f1fb1f0ad50>

The number of securities is actually pretty constant throughout 2011 ~ 2015. In many ways, this is a good sign because if the coverage had changed dramatically from one year to the next, it would've been a lot harder to try and define our in and out-of-sample datasets - which I'll do now:

Defining In/Out-of-sample datasets¶

Training and test datasets can be broken down multiple different ways. For the sake of simplicity, I'm going to be using time as my main criteria.

in-sample time periods: 2011 ~ 2014.
out-of-sample time periods: 2015 ~ 2016.

From now on and until the last part of this series, 2015 will only be used for out-of-sample backtesting.

At this point, I want to make sure that there's nothing drastically wrong with my universe constraints so I'm going to plot the number of observations per sector and the number of securities per sector as a simple check.

Remember that my universe is now filtered down by liquidity and sectors for the purpose of having good, consistent coverage over all securities. This means that my two plots should look similar to each other. If they don't, then I will need to decide if my universe constraints require additional tweaking.

obs_per_sec = {}
securities_per_sec = {}

# From [2011, 2015)
for year in range(2011, 2015):
    data = timed_datasets[year]
    
    # Get ouf filters
    liquidity_universe = filter_liquidity(year)
    sector_universe = filter_sector(year)
    liq_and_sec = pd.merge(liquidity_universe,
                           sector_universe,
                           left_on=['symbols'],
                           right_on=['symbols'])

    # Get the total number of observations
    df = get_avg_observations_per_security(data)
    df = pd.merge(df, liq_and_sec, left_on=['symbol'], right_on=['symbols'])
    df = df.dropna(subset=['morningstar_sector_code'])
    df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
    df = df.reset_index()

    # Create one pivot table summing up all observations
    sec_df = pd.pivot_table(df,
                       index='morningstar_sector_code',
                       values='observations',
                       aggfunc=np.sum)
    obs_per_sec[year] = sec_df
    
    # Also get number of securities
    sec_df = pd.pivot_table(df,
                       index='morningstar_sector_code',
                       values='index',
                       aggfunc=np.count_nonzero)
    securities_per_sec[year] = sec_df

pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Post-Filter for In-Sample")
plt.legend(loc='best')
plt.ylabel("Number of Observations")
plt.xlabel("Year")

<matplotlib.text.Text at 0x7f1fb1aba490>

pd.DataFrame(securities_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Securities Covered Post-Filter for In-Sample")
plt.legend(loc='best')
plt.ylabel("Number of Securities")
plt.xlabel("Year")

<matplotlib.text.Text at 0x7f1fb2542610>

Great, there are no drastic differences between the two. The number of securities matches up fairly well to the number of observations and I'm happy with my universe and training dataset as of now. This means that I can start working on formulating and testing hypotheses about my data which will be the bulk of part three of this series.

So as a quick summary of what I did:

Created a few guidelines to design the starting point of my research:
- My starting capital base (1,000,000). This helped me decide a minimum level of liquidity (ADV) required per security which was 10,000,000 avg in the past 252 trading days
- Looked at both sector (in part 1) and liquidity breakdowns of the PsychSignal dataset to remove sectors and liquidity breakdowns with the lowest amount of data coverage. The goal being to have a universe of stocks with consistent data coverage.
Defined my universe of securities through a series of filters
- Exclude all securities that belong to a liquidity basket greater than 3501.
- Exclude all securities belonging to the Real Estate, Consumer Defensive, Communication Services, and Utilities Sectors (104, 205, 207, 308)
- Exclude all securities with an ADV (average daily volume) < 10,000,000
Finally, I separated my in/out-of-sample datasets using time as the breakdown:
- In-sample: 2011 ~ 2014
- Out-of-sample: 2015 ~ 2016

Thanks for reading and send any feedback to slee@quantopian.com

	symbols	liquidity	basket	year
0	AA	3.683450e+08	500	2011
1	AAPL	6.000194e+09	500	2011
2	ABT	3.677848e+08	500	2011
3	ABX	4.052596e+08	500	2011
4	ADSK	1.114378e+08	500	2011
5	ACI	1.358914e+08	500	2011
6	ADBE	1.711786e+08	500	2011
7	ADI	1.221025e+08	500	2011
8	ADM	1.716069e+08	500	2011
9	AEM	1.337565e+08	500	2011
10	AEP	1.204772e+08	500	2011
11	AET	1.571651e+08	500	2011
12	AFL	1.769785e+08	500	2011
13	AGN	1.409660e+08	500	2011
14	HES	2.502724e+08	500	2011
15	AIG	2.247664e+08	500	2011
16	ALU	1.143015e+08	500	2011
17	ALTR	1.958704e+08	500	2011
18	AMAT	2.129287e+08	500	2011
19	AMD	1.507803e+08	500	2011
20	TWX	2.546795e+08	500	2011
21	AMGN	3.342394e+08	500	2011
22	AON	1.040390e+08	500	2011
23	APA	3.365420e+08	500	2011
24	APC	2.936566e+08	500	2011
25	APD	1.053284e+08	500	2011
26	ATML	1.017627e+08	500	2011
27	ADP	1.315779e+08	500	2011
28	AVP	9.186228e+07	500	2011
29	AXP	3.311160e+08	500	2011
...	...	...	...	...
5423	WRB_PRACL	2.450992e+05	5500	2011
5424	RGSE	6.157818e+04	5500	2011
5425	HPJ	9.043201e+04	5500	2011
5426	PWND	1.126050e+05	5500	2011
5427	NNA	2.045247e+05	5500	2011
5428	CMM	3.809206e+04	5500	2011
5429	GPRC	6.179802e+04	5500	2011
5430	GAT_CL	2.394969e+05	5500	2011
5431	EDT_CL	2.430380e+05	5500	2011
5432	WRD_CL	1.308036e+05	5500	2011
5433	ALN	1.274535e+05	5500	2011
5434	ADUS	1.301136e+05	5500	2011
5435	OIBR_C	1.170264e+05	5500	2011
5436	SCU_CL	1.628167e+05	5500	2011
5437	EQCN_CL	2.143016e+05	5500	2011
5438	ABCD	1.106759e+05	5500	2011
5439	CCM	1.890817e+05	5500	2011
5440	AMCF	9.668279e+04	5500	2011
5441	TNGN	2.337293e+05	5500	2011
5442	PSCI	1.373911e+05	5500	2011
5443	PSCF	2.165876e+05	5500	2011
5444	PSCU	2.118109e+05	5500	2011
5445	EFM	1.514954e+05	5500	2011
5446	MITL	2.064919e+05	5500	2011
5447	LMNR	2.355890e+05	5500	2011
5448	GNMK	1.871564e+05	5500	2011
5449	EROCW	1.241631e+05	5500	2011
5450	ABAC	1.665183e+05	5500	2011
5451	FRF	2.321658e+05	5500	2011
5452	QADA	1.738989e+05	5500	2011