Notebook

PsychSignal Series: Research Design

Research design is a fundamental and often over-looked part of the algorithm creation process. It is the simple point where any grounded quant will ask themselves, "What is my universe?"; and "What is my training/testing dataset?" These two simple questions lay a solid framework from which to frame and validate results from factor research and backtesting.

Here's an overview of what you'll learn from this notebook:

  • How to breakdown a list of securities by liquidity baskets
  • How to set guidelines for your universe of securities based on capital base and data coverage
  • How to validate your universe constraints with your in-sample datasets.

And in terms of pacing, the bolded section is where you are now:

  • Introduction - Examining the data. My goal here is to simply look at the dataset and understand what it looks like. I’ll be answering simple questions like, “How many stocks are covered?”; “Which sectors have the most coverage?”; and “What’s the distribution of sentiment scores?”. These are very basic but fundamentally important questions that lay the groundwork for all further development.
  • Research Design - Here, I’ll be setting up my environment for hypothesis testing define my in and out-of-sample datasets both cross-sectionally and through liquidity thresholds.
  • Hypothesis Testing - This is where I’ll be setting up a number of different hypotheses for my data and testing them through event studies and cross-sectional studies. The Factor Tearsheet and Event Study notebooks will be used heavily. The goal is to develop an alpha factor to use for strategy creation.
  • Strategy Creation - After I’ve developed a hypothesis and seen that it holds up consistently over different liquidity and sector partitions in my in-sample dataset, I’ll finally begin the process of developing my trading strategy. I’ll be asking questions like “Is my factor strong enough by itself?”; “What is its correlation with other factors?”. Once these questions have been answered, the trading strategy will be constructed and I’ll move onto the next section
  • Out-Of-Sample Test - Here, my main goal is to verify the work of steps 1~4 with my out-of-sample dataset. It will involve repeating many of the steps in 2~4 as well the use of the backtester (notice how only step 5 involves the backtester)
In [2]:
# Importing Data
from __future__ import division
from random import randint
import numpy as np
import pandas as pd
import scipy as sp
import pyfolio as pf
import matplotlib.pyplot as plt
import seaborn as sns
from quantopian.pipeline import Pipeline
from quantopian.pipeline import CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import Latest
from itertools import chain
from datetime import date
from blaze import by, merge
from odo import odo

SECTOR_NAMES = {
 101: 'Basic Materials',
 102: 'Consumer Cyclical',
 103: 'Financial Services',
 104: 'Real Estate',
 205: 'Consumer Defensive',
 206: 'Healthcare',
 207: 'Utilities',
 308: 'Communication Services',
 309: 'Energy',
 310: 'Industrials',
 311: 'Technology' ,
}

# Plotting colors
c = "#38BB86"

# Fundamentals for sector data
fundamentals = init_fundamentals()

# Importing our Data/Sample Version 
# PsychSignal's Twitter & StockTwits with Retweets sample is available from 24 Aug 2009 - 09 Jan 2016
from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset
# from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits as dataset

Analyzing Liquidity Baskets

In the previous notebook, I looked at PsychSignal's data through a number of different dimensions. One of the major parts that I didn't do was to break down the universe by liquidity baskets. This is an important distinction, especially for any sort of social media or news sentiment analysis, because stock popularity and liquidity often go hand-in-hand. I.E. AAPL is one of the most liquid securities in the world, but you've probably never heard of EXAR

So that's what I'm going to do now and use that information, along with everything from part 1, to guide my universe choices.

In [3]:
"""
Breaking down my datasets by year
"""
# Defining my yearly range
years = range(2011, 2016)
def get_dataset_for_year(dataset, year):
    start_date = pd.to_datetime(date(year, 1, 1))
    end_date = pd.to_datetime(date(year, 12, 31))
    dataset = dataset[dataset.asof_date >= start_date]
    dataset = dataset[dataset.asof_date <= end_date]
    return dataset

# Getting my `timed_datasets` which I'll use later on for different plots
timed_datasets = {year: get_dataset_for_year(dataset, year) for year in years}
In [5]:
"""
Using Pipeline to breakdown securities by liquidity baskets
"""

class Liquidity(CustomFactor):   
    """
    Get the average 252 day ADV
    """
    inputs = [USEquityPricing.volume, USEquityPricing.close] 
    window_length = 252

    def compute(self, today, assets, out, volume, close): 
        out[:] = (volume * close).mean(axis=0)

def get_liquidity_for_year(top_breakdown, bot_breakdown,
                           year):
    """
    Returns a DataFrame of stocks for the year belonging to
    a liquidity ranking of [bot_breakdown, top_breakdown)
    """
    liquidity = Liquidity()
    liquidity_rank = liquidity.rank(ascending=False)

    # Using a screen to only get the breakdowns we need
    ok_universe = (top_breakdown > liquidity_rank) & (bot_breakdown <= liquidity_rank)
    pipe = Pipeline()
    pipe.add(liquidity, 'liquidity')
    pipe.set_screen(ok_universe)
    
    # Only use the last day of the year since 252 trading days are already used
    get_date = '%s-12-31' % year
    factor = run_pipeline(pipe, start_date=get_date, end_date=get_date)
    
    # Rename for plotting purposes
    factor = factor.reset_index().drop(['level_0'], axis=1).rename(columns={'level_1': 'symbols'})
    factor['symbols'] = factor['symbols'].apply(lambda x: x.symbol)
    factor['basket'] = top_breakdown
    factor['year'] = year
    return factor

liquidity_baskets = {} # This will become a dict of {year: DataFrame}
liquidity_breakdowns = range(500, 8500, 500)
for year in years:
    for i, breakdown in enumerate(liquidity_breakdowns):
        try:
            basket = get_liquidity_for_year(breakdown, breakdown - 500,
                                            year)
        except:
            continue

        if i == 0:
            df = basket
        else:
            df = df.append(basket, ignore_index=True)
    liquidity_baskets[year] = df

Now that I have my liquidity baskets per year, let's see what they look like.

In [6]:
liquidity_baskets[2011]
Out[6]:
symbols liquidity basket year
0 AA 3.683450e+08 500 2011
1 AAPL 6.000194e+09 500 2011
2 ABT 3.677848e+08 500 2011
3 ABX 4.052596e+08 500 2011
4 ADSK 1.114378e+08 500 2011
5 ACI 1.358914e+08 500 2011
6 ADBE 1.711786e+08 500 2011
7 ADI 1.221025e+08 500 2011
8 ADM 1.716069e+08 500 2011
9 AEM 1.337565e+08 500 2011
10 AEP 1.204772e+08 500 2011
11 AET 1.571651e+08 500 2011
12 AFL 1.769785e+08 500 2011
13 AGN 1.409660e+08 500 2011
14 HES 2.502724e+08 500 2011
15 AIG 2.247664e+08 500 2011
16 ALU 1.143015e+08 500 2011
17 ALTR 1.958704e+08 500 2011
18 AMAT 2.129287e+08 500 2011
19 AMD 1.507803e+08 500 2011
20 TWX 2.546795e+08 500 2011
21 AMGN 3.342394e+08 500 2011
22 AON 1.040390e+08 500 2011
23 APA 3.365420e+08 500 2011
24 APC 2.936566e+08 500 2011
25 APD 1.053284e+08 500 2011
26 ATML 1.017627e+08 500 2011
27 ADP 1.315779e+08 500 2011
28 AVP 9.186228e+07 500 2011
29 AXP 3.311160e+08 500 2011
... ... ... ... ...
5423 WRB_PRACL 2.450992e+05 5500 2011
5424 RGSE 6.157818e+04 5500 2011
5425 HPJ 9.043201e+04 5500 2011
5426 PWND 1.126050e+05 5500 2011
5427 NNA 2.045247e+05 5500 2011
5428 CMM 3.809206e+04 5500 2011
5429 GPRC 6.179802e+04 5500 2011
5430 GAT_CL 2.394969e+05 5500 2011
5431 EDT_CL 2.430380e+05 5500 2011
5432 WRD_CL 1.308036e+05 5500 2011
5433 ALN 1.274535e+05 5500 2011
5434 ADUS 1.301136e+05 5500 2011
5435 OIBR_C 1.170264e+05 5500 2011
5436 SCU_CL 1.628167e+05 5500 2011
5437 EQCN_CL 2.143016e+05 5500 2011
5438 ABCD 1.106759e+05 5500 2011
5439 CCM 1.890817e+05 5500 2011
5440 AMCF 9.668279e+04 5500 2011
5441 TNGN 2.337293e+05 5500 2011
5442 PSCI 1.373911e+05 5500 2011
5443 PSCF 2.165876e+05 5500 2011
5444 PSCU 2.118109e+05 5500 2011
5445 EFM 1.514954e+05 5500 2011
5446 MITL 2.064919e+05 5500 2011
5447 LMNR 2.355890e+05 5500 2011
5448 GNMK 1.871564e+05 5500 2011
5449 EROCW 1.241631e+05 5500 2011
5450 ABAC 1.665183e+05 5500 2011
5451 FRF 2.321658e+05 5500 2011
5452 QADA 1.738989e+05 5500 2011

5453 rows × 4 columns

So now that I have code to assign liquidity baskets, I want to look at the number of observations per liquidity basket. This is an important step because while the data has already been broken down into sectors, there are different liquidity baskets within each sector. So looking at both liquidity and sector breakdowns will help me make clear and strong choices about the universe.

In [7]:
def get_avg_observations_per_security(dataset):
    avg_per_security = by(dataset.symbol,
                          observations=dataset.asof_date.count())
    avg_per_security = odo(avg_per_security, pd.DataFrame)
    return avg_per_security

# Get number of observations per liquidity basket
obs_per_sec = {}
for year, data in timed_datasets.iteritems():
    df = get_avg_observations_per_security(data)
    df = pd.merge(df, liquidity_baskets[year], left_on=['symbol'], right_on=['symbols'])
    df = df.reset_index()
    sec_df = pd.pivot_table(df,
                       index='basket',
                       values='observations',
                       aggfunc=np.sum)
    obs_per_sec[year] = sec_df
In [9]:
pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Covered Per Basket")
plt.legend(loc='upper left')
plt.ylabel("Number of Observations")
plt.xlabel("Year")
Out[9]:
<matplotlib.text.Text at 0x7f1fb2361910>

The graph shows that the highest liquidity baskets get the most coverage from PsychSignal. This is not surprising. The relationship between popularity and stock liquidity makes intuitive sense. Popular stocks get covered more often in news/social media, people end up trading the stocks they know about, and the cycle continues.

What's important to take away from this is that liquidity baskets below a threshold end up getting less coverage from PsychSignal. Given that my end goal is to have a viable, practical trading strategy where orders are filled consistently, I want stable PsychSignal coverage for the securities that I include in my universe. What good is it if a stock only has PsychSignal data for one day out of the year but is null the rest? Having a universe of securities that are well covered by the signal (PsychSignal's data) I'm trying to evaluate is important in designing my algorithm.

With that in mind, it's time to define my universe.

Defining The Universe

The first step is to know my capital base. As a good starting point, I'm going to use 1,000,000. By defining a capital base early, it allows me to set a minimum liquidity threshold. In trading terms, one million dollars isn't a lot of money, but sending an order for $500,000 of an illiquid security is only going to end in disaster. So to prevent order fulfillment issues, I'm going to set a minimum ADV (Average Daily Trading Volume) for my universe.

That along with the objective of having clear and consistent PsychSignal security coverage, I'm going to create a set of filters for my universe that will achieve these two goals:

  • Exclude all securities that belong to a liquidity basket greater than 3501.
  • Exclude all securities belonging to the Real Estate, Consumer Defensive, Communication Services, and Utilities Sectors (104, 205, 207, 308)
  • Exclude all securities with an ADV (average daily volume) < $10,000,000

Just so I know that these filters won't narrow my universe too much, let's look at the number of securities that I'll be working with year-over-year.

In [10]:
def filter_liquidity(year):
    """
    ADV > 10,000,000
    Liquidity basket < 3501
    """
    liquidity = Liquidity()
    liquidity_rank = liquidity.rank(ascending=False)
    ok_universe = (liquidity_rank < 3501)

    pipe = Pipeline()
    pipe.add(liquidity, 'liquidity')
    pipe.add(liquidity_rank, 'liquidity_rank')
    pipe.set_screen(ok_universe)
    get_date = '%s-12-31' % year
    factor = run_pipeline(pipe, start_date=get_date, end_date=get_date)
    factor = factor.reset_index().drop(['level_0'], axis=1).rename(columns={'level_1': 'symbols'})
    factor['symbols'] = factor['symbols'].apply(lambda x: x.symbol)
    factor['year'] = year
    facotr = factor[factor['liquidity'] > 10e7]
    return factor

def filter_sector(year):
    """
    Remove Real Estate, Consumer Defensive, Communication Services,
    and Utilities Sectors (104, 205, 207, 308)
    """
    sectors = get_fundamentals(query(fundamentals.asset_classification.morningstar_sector_code),
                               pd.to_datetime("%s-12-31" % year))
    sectors = sectors.T
    sectors.index = [s.symbol for s in sectors.index]
    sectors = sectors.reset_index().rename(columns={"index": "symbols"})
    sectors['year'] = year
    sectors = sectors[~sectors['morningstar_sector_code'].isin([104, 205, 207, 308])]
    return sectors


def get_security_count_after_filter(dataset, year):
    liquidity_universe = filter_liquidity(year)
    sector_universe = filter_sector(year)
    liq_and_sec = pd.merge(liquidity_universe,
                           sector_universe,
                           left_on=['symbols'],
                           right_on=['symbols'])
    dataset = dataset[dataset.symbol.isin(liq_and_sec['symbols'].unique().tolist())]
    dataset = dataset[dataset.asof_date <= pd.to_datetime("%s-12-31" % year)]
    dataset = dataset[dataset.asof_date >= pd.to_datetime("%s-01-01" % year)]
    return len(dataset.symbol.distinct())

yearly_securities = pd.Series(
    {year: get_security_count_after_filter(dataset, year) for year in years})

yearly_securities.plot(kind='bar',
                       color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year In Universe")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()
In [12]:
yearly_securities = pd.Series(
    {year: get_security_count_after_filter(dataset, year) for year in years})

yearly_securities.plot(kind='bar',
                       color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year In Universe")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()
Out[12]:
<matplotlib.legend.Legend at 0x7f1fb1f0ad50>

The number of securities is actually pretty constant throughout 2011 ~ 2015. In many ways, this is a good sign because if the coverage had changed dramatically from one year to the next, it would've been a lot harder to try and define our in and out-of-sample datasets - which I'll do now:

Defining In/Out-of-sample datasets

Training and test datasets can be broken down multiple different ways. For the sake of simplicity, I'm going to be using time as my main criteria.

  • in-sample time periods: 2011 ~ 2014.
  • out-of-sample time periods: 2015 ~ 2016.

From now on and until the last part of this series, 2015 will only be used for out-of-sample backtesting.

At this point, I want to make sure that there's nothing drastically wrong with my universe constraints so I'm going to plot the number of observations per sector and the number of securities per sector as a simple check.

Remember that my universe is now filtered down by liquidity and sectors for the purpose of having good, consistent coverage over all securities. This means that my two plots should look similar to each other. If they don't, then I will need to decide if my universe constraints require additional tweaking.

In [15]:
obs_per_sec = {}
securities_per_sec = {}

# From [2011, 2015)
for year in range(2011, 2015):
    data = timed_datasets[year]
    
    # Get ouf filters
    liquidity_universe = filter_liquidity(year)
    sector_universe = filter_sector(year)
    liq_and_sec = pd.merge(liquidity_universe,
                           sector_universe,
                           left_on=['symbols'],
                           right_on=['symbols'])

    # Get the total number of observations
    df = get_avg_observations_per_security(data)
    df = pd.merge(df, liq_and_sec, left_on=['symbol'], right_on=['symbols'])
    df = df.dropna(subset=['morningstar_sector_code'])
    df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
    df = df.reset_index()

    # Create one pivot table summing up all observations
    sec_df = pd.pivot_table(df,
                       index='morningstar_sector_code',
                       values='observations',
                       aggfunc=np.sum)
    obs_per_sec[year] = sec_df
    
    # Also get number of securities
    sec_df = pd.pivot_table(df,
                       index='morningstar_sector_code',
                       values='index',
                       aggfunc=np.count_nonzero)
    securities_per_sec[year] = sec_df
In [16]:
pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Post-Filter for In-Sample")
plt.legend(loc='best')
plt.ylabel("Number of Observations")
plt.xlabel("Year")
Out[16]:
<matplotlib.text.Text at 0x7f1fb1aba490>
In [17]:
pd.DataFrame(securities_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Securities Covered Post-Filter for In-Sample")
plt.legend(loc='best')
plt.ylabel("Number of Securities")
plt.xlabel("Year")
Out[17]:
<matplotlib.text.Text at 0x7f1fb2542610>

Great, there are no drastic differences between the two. The number of securities matches up fairly well to the number of observations and I'm happy with my universe and training dataset as of now. This means that I can start working on formulating and testing hypotheses about my data which will be the bulk of part three of this series.

So as a quick summary of what I did:

  • Created a few guidelines to design the starting point of my research:
    • My starting capital base (1,000,000). This helped me decide a minimum level of liquidity (ADV) required per security which was 10,000,000 avg in the past 252 trading days
    • Looked at both sector (in part 1) and liquidity breakdowns of the PsychSignal dataset to remove sectors and liquidity breakdowns with the lowest amount of data coverage. The goal being to have a universe of stocks with consistent data coverage.
  • Defined my universe of securities through a series of filters
    • Exclude all securities that belong to a liquidity basket greater than 3501.
    • Exclude all securities belonging to the Real Estate, Consumer Defensive, Communication Services, and Utilities Sectors (104, 205, 207, 308)
    • Exclude all securities with an ADV (average daily volume) < 10,000,000
  • Finally, I separated my in/out-of-sample datasets using time as the breakdown:
    • In-sample: 2011 ~ 2014
    • Out-of-sample: 2015 ~ 2016

Thanks for reading and send any feedback to slee@quantopian.com