Research design is a fundamental and often over-looked part of the algorithm creation process. It is the simple point where any grounded quant will ask themselves, "What is my universe?"; and "What is my training/testing dataset?" These two simple questions lay a solid framework from which to frame and validate results from factor research and backtesting.
Here's an overview of what you'll learn from this notebook:
And in terms of pacing, the bolded section is where you are now:
# Importing Data
from __future__ import division
from random import randint
import numpy as np
import pandas as pd
import scipy as sp
import pyfolio as pf
import matplotlib.pyplot as plt
import seaborn as sns
from quantopian.pipeline import Pipeline
from quantopian.pipeline import CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import Latest
from itertools import chain
from datetime import date
from blaze import by, merge
from odo import odo
SECTOR_NAMES = {
101: 'Basic Materials',
102: 'Consumer Cyclical',
103: 'Financial Services',
104: 'Real Estate',
205: 'Consumer Defensive',
206: 'Healthcare',
207: 'Utilities',
308: 'Communication Services',
309: 'Energy',
310: 'Industrials',
311: 'Technology' ,
}
# Plotting colors
c = "#38BB86"
# Fundamentals for sector data
fundamentals = init_fundamentals()
# Importing our Data/Sample Version
# PsychSignal's Twitter & StockTwits with Retweets sample is available from 24 Aug 2009 - 09 Jan 2016
from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset
# from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits as dataset
In the previous notebook, I looked at PsychSignal's data through a number of different dimensions. One of the major parts that I didn't do was to break down the universe by liquidity baskets. This is an important distinction, especially for any sort of social media or news sentiment analysis, because stock popularity and liquidity often go hand-in-hand. I.E. AAPL is one of the most liquid securities in the world, but you've probably never heard of EXAR
"""
Breaking down my datasets by year
"""
# Defining my yearly range
years = range(2011, 2016)
def get_dataset_for_year(dataset, year):
start_date = pd.to_datetime(date(year, 1, 1))
end_date = pd.to_datetime(date(year, 12, 31))
dataset = dataset[dataset.asof_date >= start_date]
dataset = dataset[dataset.asof_date <= end_date]
return dataset
# Getting my `timed_datasets` which I'll use later on for different plots
timed_datasets = {year: get_dataset_for_year(dataset, year) for year in years}
"""
Using Pipeline to breakdown securities by liquidity baskets
"""
class Liquidity(CustomFactor):
"""
Get the average 252 day ADV
"""
inputs = [USEquityPricing.volume, USEquityPricing.close]
window_length = 252
def compute(self, today, assets, out, volume, close):
out[:] = (volume * close).mean(axis=0)
def get_liquidity_for_year(top_breakdown, bot_breakdown,
year):
"""
Returns a DataFrame of stocks for the year belonging to
a liquidity ranking of [bot_breakdown, top_breakdown)
"""
liquidity = Liquidity()
liquidity_rank = liquidity.rank(ascending=False)
# Using a screen to only get the breakdowns we need
ok_universe = (top_breakdown > liquidity_rank) & (bot_breakdown <= liquidity_rank)
pipe = Pipeline()
pipe.add(liquidity, 'liquidity')
pipe.set_screen(ok_universe)
# Only use the last day of the year since 252 trading days are already used
get_date = '%s-12-31' % year
factor = run_pipeline(pipe, start_date=get_date, end_date=get_date)
# Rename for plotting purposes
factor = factor.reset_index().drop(['level_0'], axis=1).rename(columns={'level_1': 'symbols'})
factor['symbols'] = factor['symbols'].apply(lambda x: x.symbol)
factor['basket'] = top_breakdown
factor['year'] = year
return factor
liquidity_baskets = {} # This will become a dict of {year: DataFrame}
liquidity_breakdowns = range(500, 8500, 500)
for year in years:
for i, breakdown in enumerate(liquidity_breakdowns):
try:
basket = get_liquidity_for_year(breakdown, breakdown - 500,
year)
except:
continue
if i == 0:
df = basket
else:
df = df.append(basket, ignore_index=True)
liquidity_baskets[year] = df
Now that I have my liquidity baskets per year, let's see what they look like.
liquidity_baskets[2011]
So now that I have code to assign liquidity baskets, I want to look at the number of observations per liquidity basket. This is an important step because while the data has already been broken down into sectors, there are different liquidity baskets within each sector. So looking at both liquidity and sector breakdowns will help me make clear and strong choices about the universe.
def get_avg_observations_per_security(dataset):
avg_per_security = by(dataset.symbol,
observations=dataset.asof_date.count())
avg_per_security = odo(avg_per_security, pd.DataFrame)
return avg_per_security
# Get number of observations per liquidity basket
obs_per_sec = {}
for year, data in timed_datasets.iteritems():
df = get_avg_observations_per_security(data)
df = pd.merge(df, liquidity_baskets[year], left_on=['symbol'], right_on=['symbols'])
df = df.reset_index()
sec_df = pd.pivot_table(df,
index='basket',
values='observations',
aggfunc=np.sum)
obs_per_sec[year] = sec_df
pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Covered Per Basket")
plt.legend(loc='upper left')
plt.ylabel("Number of Observations")
plt.xlabel("Year")
The graph shows that the highest liquidity baskets get the most coverage from PsychSignal. This is not surprising. The relationship between popularity and stock liquidity makes intuitive sense. Popular stocks get covered more often in news/social media, people end up trading the stocks they know about, and the cycle continues.
What's important to take away from this is that liquidity baskets below a threshold end up getting less coverage from PsychSignal. Given that my end goal is to have a viable, practical trading strategy where orders are filled consistently, I want stable PsychSignal coverage for the securities that I include in my universe. What good is it if a stock only has PsychSignal data for one day out of the year but is null the rest? Having a universe of securities that are well covered by the signal (PsychSignal's data) I'm trying to evaluate is important in designing my algorithm.
With that in mind, it's time to define my universe.
The first step is to know my capital base. As a good starting point, I'm going to use 1,000,000. By defining a capital base early, it allows me to set a minimum liquidity threshold. In trading terms, one million dollars isn't a lot of money, but sending an order for $500,000 of an illiquid security is only going to end in disaster. So to prevent order fulfillment issues, I'm going to set a minimum ADV (Average Daily Trading Volume) for my universe.
That along with the objective of having clear and consistent PsychSignal security coverage, I'm going to create a set of filters for my universe that will achieve these two goals:
Just so I know that these filters won't narrow my universe too much, let's look at the number of securities that I'll be working with year-over-year.
def filter_liquidity(year):
"""
ADV > 10,000,000
Liquidity basket < 3501
"""
liquidity = Liquidity()
liquidity_rank = liquidity.rank(ascending=False)
ok_universe = (liquidity_rank < 3501)
pipe = Pipeline()
pipe.add(liquidity, 'liquidity')
pipe.add(liquidity_rank, 'liquidity_rank')
pipe.set_screen(ok_universe)
get_date = '%s-12-31' % year
factor = run_pipeline(pipe, start_date=get_date, end_date=get_date)
factor = factor.reset_index().drop(['level_0'], axis=1).rename(columns={'level_1': 'symbols'})
factor['symbols'] = factor['symbols'].apply(lambda x: x.symbol)
factor['year'] = year
facotr = factor[factor['liquidity'] > 10e7]
return factor
def filter_sector(year):
"""
Remove Real Estate, Consumer Defensive, Communication Services,
and Utilities Sectors (104, 205, 207, 308)
"""
sectors = get_fundamentals(query(fundamentals.asset_classification.morningstar_sector_code),
pd.to_datetime("%s-12-31" % year))
sectors = sectors.T
sectors.index = [s.symbol for s in sectors.index]
sectors = sectors.reset_index().rename(columns={"index": "symbols"})
sectors['year'] = year
sectors = sectors[~sectors['morningstar_sector_code'].isin([104, 205, 207, 308])]
return sectors
def get_security_count_after_filter(dataset, year):
liquidity_universe = filter_liquidity(year)
sector_universe = filter_sector(year)
liq_and_sec = pd.merge(liquidity_universe,
sector_universe,
left_on=['symbols'],
right_on=['symbols'])
dataset = dataset[dataset.symbol.isin(liq_and_sec['symbols'].unique().tolist())]
dataset = dataset[dataset.asof_date <= pd.to_datetime("%s-12-31" % year)]
dataset = dataset[dataset.asof_date >= pd.to_datetime("%s-01-01" % year)]
return len(dataset.symbol.distinct())
yearly_securities = pd.Series(
{year: get_security_count_after_filter(dataset, year) for year in years})
yearly_securities.plot(kind='bar',
color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year In Universe")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()
yearly_securities = pd.Series(
{year: get_security_count_after_filter(dataset, year) for year in years})
yearly_securities.plot(kind='bar',
color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year In Universe")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()
The number of securities is actually pretty constant throughout 2011 ~ 2015. In many ways, this is a good sign because if the coverage had changed dramatically from one year to the next, it would've been a lot harder to try and define our in and out-of-sample datasets - which I'll do now:
Training and test datasets can be broken down multiple different ways. For the sake of simplicity, I'm going to be using time as my main criteria.
From now on and until the last part of this series, 2015 will only be used for out-of-sample backtesting.
At this point, I want to make sure that there's nothing drastically wrong with my universe constraints so I'm going to plot the number of observations per sector and the number of securities per sector as a simple check.
Remember that my universe is now filtered down by liquidity and sectors for the purpose of having good, consistent coverage over all securities. This means that my two plots should look similar to each other. If they don't, then I will need to decide if my universe constraints require additional tweaking.
obs_per_sec = {}
securities_per_sec = {}
# From [2011, 2015)
for year in range(2011, 2015):
data = timed_datasets[year]
# Get ouf filters
liquidity_universe = filter_liquidity(year)
sector_universe = filter_sector(year)
liq_and_sec = pd.merge(liquidity_universe,
sector_universe,
left_on=['symbols'],
right_on=['symbols'])
# Get the total number of observations
df = get_avg_observations_per_security(data)
df = pd.merge(df, liq_and_sec, left_on=['symbol'], right_on=['symbols'])
df = df.dropna(subset=['morningstar_sector_code'])
df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
df = df.reset_index()
# Create one pivot table summing up all observations
sec_df = pd.pivot_table(df,
index='morningstar_sector_code',
values='observations',
aggfunc=np.sum)
obs_per_sec[year] = sec_df
# Also get number of securities
sec_df = pd.pivot_table(df,
index='morningstar_sector_code',
values='index',
aggfunc=np.count_nonzero)
securities_per_sec[year] = sec_df
pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Post-Filter for In-Sample")
plt.legend(loc='best')
plt.ylabel("Number of Observations")
plt.xlabel("Year")
pd.DataFrame(securities_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Securities Covered Post-Filter for In-Sample")
plt.legend(loc='best')
plt.ylabel("Number of Securities")
plt.xlabel("Year")
Great, there are no drastic differences between the two. The number of securities matches up fairly well to the number of observations and I'm happy with my universe and training dataset as of now. This means that I can start working on formulating and testing hypotheses about my data which will be the bulk of part three of this series.
So as a quick summary of what I did:
Thanks for reading and send any feedback to slee@quantopian.com