PsychSignal is a data analytics firm that provides real time Trader Mood metrics for US equities. PsychSignal uses their natural language processing (NLP) engine to analyze millions of social media messages in order to provide two quantified sentiment scores for each security.
Two sources for social media are used: Twitter and StockTwits. This PsychSignal series will use the Twitter & StockTwits with Retweets version, however you can find additional versions through Quantopian Data.
This introduction will lay the groundwork for the rest of this series:
Introduction - Examining the data. My goal here is to simply look at the dataset and understand what it looks like. I’ll be answering simple questions like, “How many stocks are covered?”; “Which sectors have the most coverage?”; and “What’s the distribution of sentiment scores?”. These are very basic but fundamentally important questions that lay the groundwork for all further development.
Research Design - Here, I’ll be setting up my environment for hypothesis testing define my in and out-of-sample datasets both cross-sectionally and through liquidity thresholds. With this, I'll be looking closely at things like the information coefficient and autocorrelation of securities within my testing (in-sample datasets). The Factor Tearsheet will be used very heavily for this.
Hypothesis Testing - This is where I’ll be setting up a number of different hypotheses for my data and testing them through event studies and cross-sectional studies. The Factor Tearsheet and Event Study notebooks will be used heavily. The goal is to develop an alpha factor to use for strategy creation.
Strategy Creation - After I’ve developed a hypothesis and seen that it holds up consistently over different liquidity and sector partitions in my in-sample dataset, I’ll finally begin the process of developing my trading strategy. I’ll be asking questions like “Is my factor strong enough by itself?”; “What is its correlation with other factors?”. Once these questions have been answered, the trading strategy will be constructed and I’ll move onto the next section
Out-Of-Sample Test - Here, my main goal is to verify the work of steps 1~4 with my out-of-sample dataset. It will involve repeating many of the steps in 2~4 as well the use of the backtester (notice how only step 5 involves the backtester)
I'm going to spend some time visualizing the data by looking what columns there are and how the first few rows of the data are formatted.
# Importing Data
from __future__ import division
import numpy as np
import pandas as pd
import scipy as sp
import pyfolio as pf
import matplotlib.pyplot as plt
import seaborn as sns
from quantopian.pipeline import Pipeline
from quantopian.pipeline import CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import Latest
from itertools import chain
from datetime import date
from blaze import by, merge
from odo import odo
SECTOR_NAMES = {
101: 'Basic Materials',
102: 'Consumer Cyclical',
103: 'Financial Services',
104: 'Real Estate',
205: 'Consumer Defensive',
206: 'Healthcare',
207: 'Utilities',
308: 'Communication Services',
309: 'Energy',
310: 'Industrials',
311: 'Technology' ,
}
# Plotting colors
c = "#38BB86"
# Fundamentals for sector data
fundamentals = init_fundamentals()
# Importing our Data/Sample Version
# PsychSignal's Twitter & StockTwits with Retweets sample is available from 24 Aug 2009 - 09 Jan 2016
from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset
# from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits as dataset
dataset.columns
There are a number of different datapoints available. I'm going to define them individually so I can have a better understanding of them:
So there are a few key metrics that stick out to me: the number of bullish/bearish messages, the intensity scores, and the total scanned messages seem important for later on. Before that, I'm going to see how the data is formatted.
dataset[:10]
Not surprising. The intensity scores are on a float level rather than integers from 0 ~ 4 and it seems that some columns are just reductions of others (bull_minus_bear, bull_bear_msg_ratio).
What sticks out to me is that each bullish and bearish intensity score can contain a variable number of messages. Although it's still way too early to come to a conclusion, I'll keep in mind that each bullish and bearish intensity score contains a sample size attached to it.
I think I have a sense of what the data looks like so let's look at the overall coverage.
num_securities = len(dataset.symbol.distinct())
print "The number of securities covered is %s" % num_securities
Okay so that's the overall securities covered over 4 years, I want to see if that number has changed over time.
"""
Here I'm going to be separating the data into yearly chunks.
As well as getting the stock sectors per year so we can match them up later.
"""
# Defining my yearly range
years = range(2011, 2016)
def get_dataset_for_year(dataset, year):
start_date = pd.to_datetime(date(year, 1, 1))
end_date = pd.to_datetime(date(year, 12, 31))
dataset = dataset[dataset.asof_date >= start_date]
dataset = dataset[dataset.asof_date <= end_date]
return dataset
# Getting my `timed_datasets` which I'll use later on for different plots
timed_datasets = {year: get_dataset_for_year(dataset, year) for year in years}
# Number of securities per year
yearly_securities = pd.Series(
{year: len(d.symbol.distinct()) for year, d in timed_datasets.iteritems()})
def _get_sectors(year):
sectors = get_fundamentals(query(fundamentals.asset_classification.morningstar_sector_code),
pd.to_datetime("%s-01-01" % year))
sectors = sectors.T
sectors.index = [s.symbol for s in sectors.index]
return sectors
# Get the sectors per year
sectors = {year: _get_sectors(year) for year in years}
print (yearly_securities.iloc[-1] - yearly_securities.iloc[0])/(yearly_securities.iloc[0])
yearly_securities
yearly_securities.plot(kind='bar',
color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()
Looks like the number of securities covered has grown quite a bit over time. I want to add another layer of depth and look at this by sector.
"""
Getting securities per sector
"""
# Average number of observations per security
def get_observations_per_security(dataset):
avg_per_security = by(dataset.symbol,
observations=dataset.asof_date.count())
avg_per_security = odo(avg_per_security, pd.DataFrame)
return avg_per_security
securities_per_sec = {}
count = 0
for year, data in timed_datasets.iteritems():
df = get_observations_per_security(data)
df = pd.merge(df, sectors[year], left_on=['symbol'], right_index=True).dropna(subset=['morningstar_sector_code'])
df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
df = df.reset_index()
sec_df = pd.pivot_table(df,
index='morningstar_sector_code',
values='index',
aggfunc=np.count_nonzero)
securities_per_sec[year] = sec_df
pd.DataFrame(securities_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Securities Covered Per Sector")
plt.legend(loc='best')
plt.ylabel("Number of Securities")
plt.xlabel("Year")
print "Sectors ranked by number of securities"
pd.DataFrame(securities_per_sec).rank().mean(axis=1).order(ascending=False)
So it looks like certain sectors like Financial Services and Technology have a wider breadth of stock coverage and before moving on to see if the same applies for number of observations (number of rows in the dataset), let's do a quick summary:
If a security had 330 observations in a year, it means that that security had 330 days worth of coverage.
This is an interesting metric to look at because while the number of securities provides the breadth of coverage, the number of observations shows the depth of coverage.
# Total number of observations
num_values = int(dataset.asof_date.count())
avg_per_security = get_observations_per_security(dataset)
print "The number of observations is %s" % num_values
print "The average number of observations per security is %s" % avg_per_security.observations.mean()
yearly_observations = pd.Series(
{year: int(d.asof_date.count()) for year, d in timed_datasets.iteritems()})
yearly_obs_per_sec = pd.Series({year: get_observations_per_security(d).observations.mean()
for year, d in timed_datasets.iteritems()})
obs_per_sec = {}
count = 0
for year, data in timed_datasets.iteritems():
df = get_observations_per_security(data)
df = pd.merge(df, sectors[year], left_on=['symbol'], right_index=True).dropna(subset=['morningstar_sector_code'])
df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
df = df.reset_index()
obs_df = pd.pivot_table(df,
index='morningstar_sector_code',
values='observations',
aggfunc=np.sum)
obs_per_sec[year] = obs_df
So overall, there's a pretty large amount of observations, but how has that number changed over time?
print (yearly_observations.iloc[-1] - yearly_observations.iloc[0])/(yearly_observations.iloc[0])
yearly_observations
What the above is showing is that the number of observations has grown 136% over 4 years. So later on while I'm designing my factor, I can keep in mind that the further back I go, the less depth of coverage I'll have.
yearly_observations.plot(kind='line',
color=c, legend=True, grid=False)
yearly_observations.name = "Observations"
plt.title("Number of Observations Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Observations")
plt.legend()
pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Covered Per Sector")
plt.legend(loc='best')
plt.ylabel("Number of Observations")
plt.xlabel("Year")
This is really interesting. I can see that a few sectors make up the bulk of the observations. In 2014 and 2015, stocks in Technology and Healthcare had the most amount of coverage versus communication services or consumer defensive.
Let's look at this metric on a per-stock basis.
print "Ordered sector observations"
pd.DataFrame(obs_per_sec).rank().mean(axis=1).order(ascending=False)
yearly_obs_per_sec.plot(kind='line',
color=c, legend=True, grid=False)
yearly_obs_per_sec.name = "Observations Per Security"
plt.title("Average Number of Observations Per Security Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Observations")
plt.legend()
The graph above falls in-line with what I saw before. Overall number of observations as well as observations per security have increased year-over-year.
Now, let's take a look at the actual bullish/bearish intensity scores.
First, I want to plot the distribution of scores. The way I'm going to do that is just choose two random time slices in 2015 and plot its distribution.
td = timed_datasets[2015]
td = td[td['total_scanned_messages'] > 0]
td = td[td['bull_scored_messages'] > 0]
td = td[td['bear_scored_messages'] > 0]
random_2015 = odo(td[['bullish_intensity', 'bearish_intensity', 'asof_date']], pd.DataFrame)
random_2015.bearish_intensity.plot(kind='hist', label="Bullish Intensity")
random_2015.bullish_intensity.plot(kind='hist', alpha=.5, label="Bearish Intensity")
plt.legend()
minimum_date = random_2015.asof_date.min()
maximum_date = random_2015.asof_date.max()
plt.title("Bullish/Bearish Intensity Distribution from %s to %s" % (minimum_date, maximum_date))
plt.xlabel("Intensity Score")
plt.ylabel("Frequency")
# Looking at the same plot as above but in a different time
td = timed_datasets[2015]
td = td[td['total_scanned_messages'] > 0]
td = td[td['bull_scored_messages'] > 0]
td = td[td['bear_scored_messages'] > 0]
td = td[td['asof_date'] > pd.to_datetime("7/01/2015")]
random_2015 = odo(td[['bullish_intensity', 'bearish_intensity', 'asof_date']], pd.DataFrame)
random_2015.bearish_intensity.plot(kind='hist', label="Bullish Intensity")
random_2015.bullish_intensity.plot(kind='hist', alpha=.5, label="Bearish Intensity")
plt.legend()
minimum_date = random_2015.asof_date.min()
maximum_date = random_2015.asof_date.max()
plt.title("Bullish/Bearish Intensity Distribution from %s to %s" % (minimum_date, maximum_date))
plt.xlabel("Intensity Score")
plt.ylabel("Frequency")
These are interesting distributions. You can see that in two randomly sliced time points, it remained relatively the same with bullish intensity scores densely packed around 2.0 with bearish intensity scores more widely spread, but centered around the same point.
Now, I want to just take a general look at the trends of these scores over the years.
# Total number of messages scanned per security
def get_scores_per_security(dataset):
avg = by(dataset.symbol,
bearish_scores=dataset.bearish_intensity.mean(),
bullish_scores=dataset.bullish_intensity.mean())
avg = odo(avg, pd.DataFrame)
return avg
yearly_bullish_bearish_scores = pd.DataFrame({year: get_scores_per_security(d).mean()
for year, d in timed_datasets.iteritems()})
yearly_bullish_bearish_scores.T.plot(kind='line', grid=False)
plt.xticks(years, ['2011', '2012', '2013', '2014', '2015'])
plt.title("Average Sentiment Per Security")
plt.xlabel("Year")
plt.ylabel("Sentiment")
So both the bearish and bullish intensity scores have increased over the year and, on average, the bullish intensity score is higher than the bearish intensity score. I also remember in the Data Formatting section that each score contains a sample size of bull/bear scored messages. So let's see that number now.
# Total number of messages scanned per security
def get_msgs_per_security(dataset):
avg_msgs = by(dataset.symbol,
bullish_messages=dataset.bull_scored_messages.mean(),
bearish_messages=dataset.bear_scored_messages.mean(),
total_scanned=dataset.total_scanned_messages.mean())
avg_msgs = odo(avg_msgs, pd.DataFrame)
return avg_msgs
avg_msgs = get_msgs_per_security(dataset)
print "The average number of bullish messages were %s" % avg_msgs.bullish_messages.mean()
print "The average number of bearish messages were %s" % avg_msgs.bearish_messages.mean()
print "The average number of scanned messages were %s" % avg_msgs.total_scanned.mean()
yearly_bullish_bearish_scanned = pd.DataFrame({year: get_msgs_per_security(d).mean()
for year, d in timed_datasets.iteritems()})
yearly_bullish_bearish_scanned.T.plot(kind='bar', grid=False)
plt.title("Average Number of Messages Per Security Per Observation")
plt.xlabel("Year")
plt.ylabel("Number of Messages Per Security")
# Same plot as above but presented as a median
yearly_bullish_bearish_scanned = pd.DataFrame({year: get_msgs_per_security(d).median()
for year, d in timed_datasets.iteritems()})
yearly_bullish_bearish_scanned.T.plot(kind='bar', grid=False)
plt.title("Median Number of Messages Per Security Per Observation")
plt.xlabel("Year")
plt.ylabel("Number of Messages Per Security")
What I did above was first look at the average number of messages per security but I remembered that certain sectors have far greater coverage than others which makes me suspect of a fat tail in the distribution of messages per security. The median number of messages per security plot does suggest that hypothesis: Popular securities get the most amount of coverage.
I want to narrow that down and see what the most popular securities are and how much of the pie that they're getting.
# Get the number of msgs per securities per year
msgs_per_security = {year: get_msgs_per_security(d)
for year, d in timed_datasets.iteritems()}
# Let's get the top 1% per year
def _get_top_ten(d, year):
d['percent'] = d['total_scanned']/d['total_scanned'].sum()
d['year'] = year
return d
ordered_msgs = {year: _get_top_ten(d, year)
for year, d in msgs_per_security.iteritems()}
for i, (y, d) in enumerate(ordered_msgs.iteritems()):
if i == 0:
data = d
else:
data = data.append(d)
data = data.set_index(['year', 'symbol'])
data['percent'].unstack().T.mean(axis=1).order(ascending=False).head(5)
4% of 1.5 million+ observations is quite a lot and I'll have to keep in mind this fat tail of securities while determining my universe in the next part of this series.
Finally, I want to look at the consistency of these scores. Now, this part is a bit tricky because I'll already be defining a liquidity universe and such, but I'll only be looking at it from a bird's eye view and won't be reading too much into it, yet.
Autocorrelation is a good measure for determining the turnover of a factor. A factor with low autocorrelation will consistently produce high portfolio turnover and vice versa for a factor with high autocorrelation.
"""
Now we look at autocorrelation.
Taken from the Factor TearSheet:
- https://www.quantopian.com/posts/factor-tear-sheet
"""
from quantopian.pipeline.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset
class Liquidity(CustomFactor):
inputs = [USEquityPricing.volume, USEquityPricing.close]
window_length = 5
def compute(self, today, assets, out, volume, close):
out[:] = (volume * close).mean(axis=0)
class Sector(CustomFactor):
inputs = [morningstar.asset_classification.morningstar_sector_code]
window_length = 1
def compute(self, today, assets, out, msc):
out[:] = msc[-1]
def factor_rank_autocorrelation(daily_factor, time_rule='W', factor_name='factor'):
"""
Computes autocorrelation of mean factor ranks in specified timespans.
We must compare week to week factor ranks rather than factor values to account for
systematic shifts in the factor values of all names or names within a sector.
This metric is useful for measuring the turnover of a factor. If the value of a factor
for each name changes randomly from week to week, we'd expect a weekly autocorrelation of 0.
Parameters
----------
daily_factor : pd.DataFrame
DataFrame with integer index and date, equity, factor, and sector
code columns.
time_rule : string, optional
Time span to use in factor grouping mean reduction.
See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
factor_name : string
Name of factor column on which to compute IC.
Returns
-------
autocorr : pd.Series
Rolling 1 period (defined by time_rule) autocorrelation of factor values.
"""
daily_ranks = daily_factor.copy()
daily_ranks[factor_name] = daily_factor.groupby(['date', 'sector_code'])[factor_name].apply(
lambda x: x.rank(ascending=True))
equity_factor = daily_ranks.pivot(index='date', columns='equity', values=factor_name)
if time_rule is not None:
equity_factor = equity_factor.resample(time_rule, how='mean')
autocorr = equity_factor.corrwith(equity_factor.shift(1), axis=1)
return autocorr
def plot_factor_rank_auto_correlation(daily_factor, time_rule='W', factor_name='factor'):
"""
Plots factor rank autocorrelation over time. See factor_rank_autocorrelation for more details.
Parameters
----------
daily_factor : pd.DataFrame
DataFrame with date, equity, and factor value columns.
time_rule : string, optional
Time span to use in time grouping reduction prior to autocorrelation calculation.
See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
factor_name : string
Name of factor column on which to compute IC.
"""
fa = factor_rank_autocorrelation(daily_factor, time_rule=time_rule, factor_name=factor_name)
print "Mean rank autocorrelation: " + str(fa.mean())
fa.plot(title='Week-to-Week Factor Rank Autocorrelation')
plt.ylabel('autocorrelation coefficient')
plt.show()
def construct_factor_history(factor_cls, start_date='2015-10-1', end_date='2016-2-1',
factor_name='factor',
top_liquid=1000, universe_constraints=None, sector_names=None):
"""
Creates a DataFrame containing daily factor values and sector codes for a liquidity
constrained universe. The returned DataFrame is can be used in the factor tear sheet.
Parameters
----------
factor_cls : quantopian.pipeline.CustomFactor
Factor class to be computed.
start_date : string or pd.datetime
Starting date for factor computation.
end_date : string or pd.datetime
End date for factor computation.
factor_name : string, optional
Column name for factor column in returned DataFrame.
top_liquid : int, optional
Limit universe to the top N most liquid names each trading day.
Based on trailing 5 days traded dollar volume.
universe_constraints : num_expr, optional
Pipeline universe constraint.
Returns
-------
daily_factor : pd.DataFrame
DataFrame with integer index and date, equity, factor, and sector
code columns.
"""
factor = factor_cls()
sector = Sector()
liquidity = Liquidity()
liquidity_rank = liquidity.rank(ascending=False)
ok_universe = (top_liquid > liquidity_rank) & factor.eq(factor) & sector.eq(sector)
if universe_constraints is not None:
ok_universe = ok_universe & universe_constraints
pipe = Pipeline()
pipe.add(factor, factor_name)
pipe.add(sector, 'sector_code')
pipe.set_screen(ok_universe)
daily_factor = run_pipeline(pipe, start_date=start_date, end_date=end_date)
daily_factor = daily_factor.reset_index().rename(
columns={'level_0': 'date', 'level_1':'equity'})
daily_factor = daily_factor[daily_factor.sector_code != -1]
if sector_names is not None:
daily_factor.sector_code = daily_factor.sector_code.apply(
lambda x: sector_names[x])
return daily_factor
class PsychSignalBull(CustomFactor):
inputs = [dataset.bullish_intensity]
window_length = 1
def compute(self, today, assets, out, bui):
out[:] = bui[-1]
class PsychSignalBear(CustomFactor):
inputs = [dataset.bearish_intensity]
window_length = 1
def compute(self, today, assets, out, bi):
out[:] = bi[-1]
class PsychSignalBullMessages(CustomFactor):
inputs = [dataset.bull_scored_messages]
window_length = 1
def compute(self, today, assets, out, bum):
out[:] = bum[-1]
class PsychSignalBearMessages(CustomFactor):
inputs = [dataset.bear_scored_messages]
window_length = 1
def compute(self, today, assets, out, bm):
out[:] = bm[-1]
def construct_factor(custom_factor, fn):
factor = construct_factor_history(custom_factor, factor_name=fn,
start_date='2014-1-1', end_date='2015-1-1',
top_liquid=500, sector_names=SECTOR_NAMES)
return factor
print "Bullish Intensity Scores"
factor = construct_factor(PsychSignalBull, 'bullish_intensity')
plot_factor_rank_auto_correlation(factor, factor_name='bullish_intensity')
print ""
print "Bearish Intensity Scores"
factor = construct_factor(PsychSignalBear, 'bearish_intensity')
plot_factor_rank_auto_correlation(factor, factor_name='bearish_intensity')
print ""
print "Bullish Scored Messages"
factor = construct_factor(PsychSignalBullMessages, 'bull_messages')
plot_factor_rank_auto_correlation(factor, factor_name='bull_messages')
print ""
print "Bearish Scored Messages"
factor = construct_factor(PsychSignalBearMessages, 'bear_messages')
plot_factor_rank_auto_correlation(factor, factor_name='bear_messages')
Like I said before, I don't want to read too much into these scores because they're already defined into a liquidity universe and such but from first glance, the autocorrelation score for bullish and bearish messages seem support the idea that popular securities have the most coverage.
Finally, I'm going to do take a security (say the XLK Technology Sector ETF) and plot it's bullish and bearsh intensity side-by-side with the security's price.
# Only plotting for this year
random_sec = 'XLK'
start = pd.to_datetime("2013-01-01")
end = pd.to_datetime("2013-12-31")
sec_data = dataset[dataset.symbol == random_sec]
sec_data = sec_data[sec_data['asof_date'] >= start]
sec_data = sec_data[sec_data['asof_date'] <= end]
sec_data = odo(sec_data, pd.DataFrame)
sec_data = sec_data.set_index(['asof_date'])
sec_data = sec_data.fillna(0)
pricing = get_pricing(symbols(random_sec),
fields='close_price',
start_date=start,
end_date=end)
sec_data[['bullish_intensity', 'bearish_intensity']].plot()
(pricing - pricing.iloc[0]).plot()
plt.legend()
That's a little noisy, let's look at the rolling mean
pd.rolling_mean(sec_data[['bullish_intensity', 'bearish_intensity']], window=10).plot()
(pricing - pricing.iloc[0]).plot()
plt.legend()
Just from plotting this one graph, I can start to notice a few patterns. For example, during the peaks after a drawdown, both bullish and bearish intensitys are quite high (i.e. btn April and My This fits my anecdotal experience where traders are both at their most excited and fearful on a sharp high. Something to keep in mind and explore during my hypothesis testing phase.
So there's a number of different things that I've observed from just looking at the data. This will be helpful in determining our out-of-sample and in-sample datasets as well as the securities I want to use in my universe. That will be done in part two of this series - Research Design - so stay tuned for that.
Here's a final summary of what I did in this notebook.