PsychSignal Series: Introduction¶

PsychSignal is a data analytics firm that provides real time Trader Mood metrics for US equities. PsychSignal uses their natural language processing (NLP) engine to analyze millions of social media messages in order to provide two quantified sentiment scores for each security.

Two sources for social media are used: Twitter and StockTwits. This PsychSignal series will use the Twitter & StockTwits with Retweets version, however you can find additional versions through Quantopian Data.

This introduction will lay the groundwork for the rest of this series:

Introduction - Examining the data. My goal here is to simply look at the dataset and understand what it looks like. I’ll be answering simple questions like, “How many stocks are covered?”; “Which sectors have the most coverage?”; and “What’s the distribution of sentiment scores?”. These are very basic but fundamentally important questions that lay the groundwork for all further development.
Research Design - Here, I’ll be setting up my environment for hypothesis testing define my in and out-of-sample datasets both cross-sectionally and through liquidity thresholds. With this, I'll be looking closely at things like the information coefficient and autocorrelation of securities within my testing (in-sample datasets). The Factor Tearsheet will be used very heavily for this.
Hypothesis Testing - This is where I’ll be setting up a number of different hypotheses for my data and testing them through event studies and cross-sectional studies. The Factor Tearsheet and Event Study notebooks will be used heavily. The goal is to develop an alpha factor to use for strategy creation.
Strategy Creation - After I’ve developed a hypothesis and seen that it holds up consistently over different liquidity and sector partitions in my in-sample dataset, I’ll finally begin the process of developing my trading strategy. I’ll be asking questions like “Is my factor strong enough by itself?”; “What is its correlation with other factors?”. Once these questions have been answered, the trading strategy will be constructed and I’ll move onto the next section
Out-Of-Sample Test - Here, my main goal is to verify the work of steps 1~4 with my out-of-sample dataset. It will involve repeating many of the steps in 2~4 as well the use of the backtester (notice how only step 5 involves the backtester)

Notebook Guide:¶

Each section will walk specific observations of the data
At the end of each section, you'll find a small summary detailing overall findings
A final summary is contained at the end of this notebook

So what does the data look like?¶

I'm going to spend some time visualizing the data by looking what columns there are and how the first few rows of the data are formatted.

# Importing Data
from __future__ import division
import numpy as np
import pandas as pd
import scipy as sp
import pyfolio as pf
import matplotlib.pyplot as plt
import seaborn as sns
from quantopian.pipeline import Pipeline
from quantopian.pipeline import CustomFactor
from quantopian.research import run_pipeline
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.factors import Latest
from itertools import chain
from datetime import date
from blaze import by, merge
from odo import odo

SECTOR_NAMES = {
 101: 'Basic Materials',
 102: 'Consumer Cyclical',
 103: 'Financial Services',
 104: 'Real Estate',
 205: 'Consumer Defensive',
 206: 'Healthcare',
 207: 'Utilities',
 308: 'Communication Services',
 309: 'Energy',
 310: 'Industrials',
 311: 'Technology' ,
}

# Plotting colors
c = "#38BB86"

# Fundamentals for sector data
fundamentals = init_fundamentals()

# Importing our Data/Sample Version 
# PsychSignal's Twitter & StockTwits with Retweets sample is available from 24 Aug 2009 - 09 Jan 2016
from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset
# from quantopian.interactive.data.psychsignal import aggregated_twitter_withretweets_stocktwits as dataset

dataset.columns

[u'source',
 u'symbol',
 u'asof_date',
 u'bullish_intensity',
 u'bearish_intensity',
 u'bull_minus_bear',
 u'bull_scored_messages',
 u'bear_scored_messages',
 u'bull_bear_msg_ratio',
 u'total_scanned_messages',
 u'timestamp',
 u'sid']

There are a number of different datapoints available. I'm going to define them individually so I can have a better understanding of them:

Metrics¶

asof_date: The date to which this data applies.
symbol: stock ticker symbol of the affected company.
source: the same value for all records in this data set
bull_scored_messages: total count of bullish sentiment messages scored by PsychSignal's algorithm
bear_scored_messages: total count of bearish sentiment messages scored by PsychSignal's algorithm
bullish_intensity: score for each message's language for the stength of the bullishness present in the messages on a 0-4 scale. 0 indicates no bullish sentiment measured, 4 indicates strongest bullish sentiment measured. 4 is rare
bearish_intensity: score for each message's language for the stength of the bearish present in the messages on a 0-4 scale. 0 indicates no bearish sentiment measured, 4 indicates strongest bearish sentiment measured. 4 is rare
total_scanned_messages: number of messages coming through PsychSignal's feeds and attributable to a symbol regardless of whether the PsychSignal sentiment engine can score them for bullish or bearish intensity
timestamp: this is our timestamp on when we registered the data.
bull_minus_bear: subtracts the bearish intesity from the bullish intensity [BULL - BEAR] to rpovide an immediate net score.
bull_bear_msg_ratio: the ratio between bull scored messages and bear scored messages.
sid: the equity's unique identifier. Use this instead of the symbol.

So there are a few key metrics that stick out to me: the number of bullish/bearish messages, the intensity scores, and the total scanned messages seem important for later on. Before that, I'm going to see how the data is formatted.

Dataset Format¶

dataset[:10]

Not surprising. The intensity scores are on a float level rather than integers from 0 ~ 4 and it seems that some columns are just reductions of others (bull_minus_bear, bull_bear_msg_ratio).

What sticks out to me is that each bullish and bearish intensity score can contain a variable number of messages. Although it's still way too early to come to a conclusion, I'll keep in mind that each bullish and bearish intensity score contains a sample size attached to it.

I think I have a sense of what the data looks like so let's look at the overall coverage.

Number of Securities Covered in the Dataset¶

num_securities = len(dataset.symbol.distinct())
print "The number of securities covered is %s" % num_securities

The number of securities covered is 8396

Okay so that's the overall securities covered over 4 years, I want to see if that number has changed over time.

"""
Here I'm going to be separating the data into yearly chunks.
As well as getting the stock sectors per year so we can match them up later.
"""

# Defining my yearly range
years = range(2011, 2016)
def get_dataset_for_year(dataset, year):
    start_date = pd.to_datetime(date(year, 1, 1))
    end_date = pd.to_datetime(date(year, 12, 31))
    dataset = dataset[dataset.asof_date >= start_date]
    dataset = dataset[dataset.asof_date <= end_date]
    return dataset

# Getting my `timed_datasets` which I'll use later on for different plots
timed_datasets = {year: get_dataset_for_year(dataset, year) for year in years}

# Number of securities per year
yearly_securities = pd.Series(
    {year: len(d.symbol.distinct()) for year, d in timed_datasets.iteritems()})

def _get_sectors(year):
    sectors = get_fundamentals(query(fundamentals.asset_classification.morningstar_sector_code),
                               pd.to_datetime("%s-01-01" % year))
    sectors = sectors.T
    sectors.index = [s.symbol for s in sectors.index]
    return sectors

# Get the sectors per year
sectors = {year: _get_sectors(year) for year in years}

print (yearly_securities.iloc[-1] - yearly_securities.iloc[0])/(yearly_securities.iloc[0])
yearly_securities

0.168659081089

2011    6943
2012    7177
2013    7385
2014    7695
2015    8114
dtype: int64

yearly_securities.plot(kind='bar',
                       color=c, legend=True, grid=False)
yearly_securities.name = "Securities Covered"
plt.title("Number of Securities Covered Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Securities")
plt.legend()

<matplotlib.legend.Legend at 0x7f37d7bdb190>

Looks like the number of securities covered has grown quite a bit over time. I want to add another layer of depth and look at this by sector.

"""
Getting securities per sector
"""

# Average number of observations per security
def get_observations_per_security(dataset):
    avg_per_security = by(dataset.symbol,
                          observations=dataset.asof_date.count())
    avg_per_security = odo(avg_per_security, pd.DataFrame)
    return avg_per_security

securities_per_sec = {}
count = 0
for year, data in timed_datasets.iteritems():
    df = get_observations_per_security(data)
    df = pd.merge(df, sectors[year], left_on=['symbol'], right_index=True).dropna(subset=['morningstar_sector_code'])
    df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
    df = df.reset_index()
    sec_df = pd.pivot_table(df,
                       index='morningstar_sector_code',
                       values='index',
                       aggfunc=np.count_nonzero)
    securities_per_sec[year] = sec_df

pd.DataFrame(securities_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Securities Covered Per Sector")
plt.legend(loc='best')
plt.ylabel("Number of Securities")
plt.xlabel("Year")

<matplotlib.text.Text at 0x7f37dc0642d0>

print "Sectors ranked by number of securities"
pd.DataFrame(securities_per_sec).rank().mean(axis=1).order(ascending=False)

Sectors ranked by number of securities

morningstar_sector_code
Financial Services        11.0
Technology                10.0
Industrials                8.8
Consumer Cyclical          7.8
Healthcare                 7.4
Energy                     5.8
Basic Materials            5.2
Consumer Defensive         3.6
Real Estate                3.4
Utilities                  2.0
Communication Services     1.0
dtype: float64

So it looks like certain sectors like Financial Services and Technology have a wider breadth of stock coverage and before moving on to see if the same applies for number of observations (number of rows in the dataset), let's do a quick summary:

The number of securities covered has increased 17% from 2011 till 2015
- From 6,943 in 2011 to 8,114 in 2015
The financial services sector has the most amount of securities covered with technology, industrials, consumer cyclical, and healthcare following after.

Number of Observations in the Dataset¶

Observation is defined as a security's daily score.¶

If a security had 330 observations in a year, it means that that security had 330 days worth of coverage.

This is an interesting metric to look at because while the number of securities provides the breadth of coverage, the number of observations shows the depth of coverage.

# Total number of observations
num_values = int(dataset.asof_date.count())

avg_per_security = get_observations_per_security(dataset)
print "The number of observations is %s" % num_values
print "The average number of observations per security is %s" % avg_per_security.observations.mean()

The number of observations is 6550407
The average number of observations per security is 780.18187232

yearly_observations = pd.Series(
    {year: int(d.asof_date.count()) for year, d in timed_datasets.iteritems()})

yearly_obs_per_sec = pd.Series({year: get_observations_per_security(d).observations.mean()
                               for year, d in timed_datasets.iteritems()})
obs_per_sec = {}
count = 0
for year, data in timed_datasets.iteritems():
    df = get_observations_per_security(data)
    df = pd.merge(df, sectors[year], left_on=['symbol'], right_index=True).dropna(subset=['morningstar_sector_code'])
    df['morningstar_sector_code'] = df['morningstar_sector_code'].apply(lambda x: SECTOR_NAMES[x])
    df = df.reset_index()
    obs_df = pd.pivot_table(df,
                       index='morningstar_sector_code',
                       values='observations',
                       aggfunc=np.sum)
    obs_per_sec[year] = obs_df

So overall, there's a pretty large amount of observations, but how has that number changed over time?

print (yearly_observations.iloc[-1] - yearly_observations.iloc[0])/(yearly_observations.iloc[0])
yearly_observations

1.36188028567

2011     693384
2012    1257729
2013    1421092
2014    1272234
2015    1637690
dtype: int64

What the above is showing is that the number of observations has grown 136% over 4 years. So later on while I'm designing my factor, I can keep in mind that the further back I go, the less depth of coverage I'll have.

yearly_observations.plot(kind='line',
                         color=c, legend=True, grid=False)
yearly_observations.name = "Observations"
plt.title("Number of Observations Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Observations")
plt.legend()

pd.DataFrame(obs_per_sec).T.plot(kind='bar', figsize=(16, 12))
plt.title("Number of Observations Covered Per Sector")
plt.legend(loc='best')
plt.ylabel("Number of Observations")
plt.xlabel("Year")

<matplotlib.text.Text at 0x7f37d7061a10>

This is really interesting. I can see that a few sectors make up the bulk of the observations. In 2014 and 2015, stocks in Technology and Healthcare had the most amount of coverage versus communication services or consumer defensive.

Let's look at this metric on a per-stock basis.

print "Ordered sector observations"
pd.DataFrame(obs_per_sec).rank().mean(axis=1).order(ascending=False)

Ordered sector observations

morningstar_sector_code
Technology                11.0
Financial Services         9.0
Industrials                8.4
Healthcare                 8.4
Consumer Cyclical          8.2
Energy                     6.0
Basic Materials            5.0
Consumer Defensive         3.8
Real Estate                3.2
Utilities                  2.0
Communication Services     1.0
dtype: float64

yearly_obs_per_sec.plot(kind='line',
                        color=c, legend=True, grid=False)
yearly_obs_per_sec.name = "Observations Per Security"
plt.title("Average Number of Observations Per Security Per Year")
plt.xlabel("Year")
plt.ylabel("Number of Observations")
plt.legend()

<matplotlib.legend.Legend at 0x7f37d7fea490>

The graph above falls in-line with what I saw before. Overall number of observations as well as observations per security have increased year-over-year.

Summary:¶

The average number of observations have increased 136% from 2011 till 2015
- 693,384 observations in 2011
- 1,637,690 observations in 2015
Technology, financial services, industrials, healthcare, and consumer cyclical are the most popular sectors.
- This is interesting because the most amount of securities covered is in the financial services sector suggesting that a fewer number of stocks in the tech sector make up more of the observations.

Now, let's take a look at the actual bullish/bearish intensity scores.

Bullish/Bearish Intensity Overview¶

First, I want to plot the distribution of scores. The way I'm going to do that is just choose two random time slices in 2015 and plot its distribution.

td = timed_datasets[2015]
td = td[td['total_scanned_messages'] > 0]
td = td[td['bull_scored_messages'] > 0]
td = td[td['bear_scored_messages'] > 0]
random_2015 = odo(td[['bullish_intensity', 'bearish_intensity', 'asof_date']], pd.DataFrame)

random_2015.bearish_intensity.plot(kind='hist', label="Bullish Intensity")
random_2015.bullish_intensity.plot(kind='hist', alpha=.5, label="Bearish Intensity")
plt.legend()
minimum_date = random_2015.asof_date.min()
maximum_date = random_2015.asof_date.max()
plt.title("Bullish/Bearish Intensity Distribution from %s to %s" % (minimum_date, maximum_date))
plt.xlabel("Intensity Score")
plt.ylabel("Frequency")

<matplotlib.text.Text at 0x7f37d785d090>

# Looking at the same plot as above but in a different time
td = timed_datasets[2015]
td = td[td['total_scanned_messages'] > 0]
td = td[td['bull_scored_messages'] > 0]
td = td[td['bear_scored_messages'] > 0]
td = td[td['asof_date'] > pd.to_datetime("7/01/2015")]
random_2015 = odo(td[['bullish_intensity', 'bearish_intensity', 'asof_date']], pd.DataFrame)
random_2015.bearish_intensity.plot(kind='hist', label="Bullish Intensity")
random_2015.bullish_intensity.plot(kind='hist', alpha=.5, label="Bearish Intensity")
plt.legend()
minimum_date = random_2015.asof_date.min()
maximum_date = random_2015.asof_date.max()
plt.title("Bullish/Bearish Intensity Distribution from %s to %s" % (minimum_date, maximum_date))
plt.xlabel("Intensity Score")
plt.ylabel("Frequency")

<matplotlib.text.Text at 0x7f37d72db750>

These are interesting distributions. You can see that in two randomly sliced time points, it remained relatively the same with bullish intensity scores densely packed around 2.0 with bearish intensity scores more widely spread, but centered around the same point.

Now, I want to just take a general look at the trends of these scores over the years.

# Total number of messages scanned per security
def get_scores_per_security(dataset):
    avg = by(dataset.symbol,
              bearish_scores=dataset.bearish_intensity.mean(),
              bullish_scores=dataset.bullish_intensity.mean())
    avg = odo(avg, pd.DataFrame)
    return avg


yearly_bullish_bearish_scores = pd.DataFrame({year: get_scores_per_security(d).mean()
                                              for year, d in timed_datasets.iteritems()})

yearly_bullish_bearish_scores.T.plot(kind='line', grid=False)
plt.xticks(years, ['2011', '2012', '2013', '2014', '2015'])
plt.title("Average Sentiment Per Security")
plt.xlabel("Year")
plt.ylabel("Sentiment")

<matplotlib.text.Text at 0x7f37d6db6a50>

So both the bearish and bullish intensity scores have increased over the year and, on average, the bullish intensity score is higher than the bearish intensity score. I also remember in the Data Formatting section that each score contains a sample size of bull/bear scored messages. So let's see that number now.

# Total number of messages scanned per security
def get_msgs_per_security(dataset):
    avg_msgs = by(dataset.symbol,
                  bullish_messages=dataset.bull_scored_messages.mean(),
                  bearish_messages=dataset.bear_scored_messages.mean(),
                  total_scanned=dataset.total_scanned_messages.mean())
    avg_msgs = odo(avg_msgs, pd.DataFrame)
    return avg_msgs

avg_msgs = get_msgs_per_security(dataset)
print "The average number of bullish messages were %s" % avg_msgs.bullish_messages.mean()
print "The average number of bearish messages were %s" % avg_msgs.bearish_messages.mean()
print "The average number of scanned messages were %s" % avg_msgs.total_scanned.mean()

The average number of bullish messages were 1.66142390791
The average number of bearish messages were 0.723291013906
The average number of scanned messages were 10.2224685448

yearly_bullish_bearish_scanned = pd.DataFrame({year: get_msgs_per_security(d).mean()
                                              for year, d in timed_datasets.iteritems()})

yearly_bullish_bearish_scanned.T.plot(kind='bar', grid=False)
plt.title("Average Number of Messages Per Security Per Observation")
plt.xlabel("Year")
plt.ylabel("Number of Messages Per Security")

# Same plot as above but presented as a median
yearly_bullish_bearish_scanned = pd.DataFrame({year: get_msgs_per_security(d).median()
                                              for year, d in timed_datasets.iteritems()})
yearly_bullish_bearish_scanned.T.plot(kind='bar', grid=False)
plt.title("Median Number of Messages Per Security Per Observation")
plt.xlabel("Year")
plt.ylabel("Number of Messages Per Security")

<matplotlib.text.Text at 0x7f37d7e17490>

What I did above was first look at the average number of messages per security but I remembered that certain sectors have far greater coverage than others which makes me suspect of a fat tail in the distribution of messages per security. The median number of messages per security plot does suggest that hypothesis: Popular securities get the most amount of coverage.

I want to narrow that down and see what the most popular securities are and how much of the pie that they're getting.

Top 5 securities with the highest amount of messages¶

# Get the number of msgs per securities per year
msgs_per_security = {year: get_msgs_per_security(d)
                     for year, d in timed_datasets.iteritems()}

# Let's get the top 1% per year
def _get_top_ten(d, year):
    d['percent'] = d['total_scanned']/d['total_scanned'].sum()
    d['year'] = year
    return d
ordered_msgs = {year: _get_top_ten(d, year)
                for year, d in msgs_per_security.iteritems()}

for i, (y, d) in enumerate(ordered_msgs.iteritems()):
    if i == 0:
        data = d
    else:
        data = data.append(d)
data = data.set_index(['year', 'symbol'])
data['percent'].unstack().T.mean(axis=1).order(ascending=False).head(5)

symbol
AAPL    0.043073
SPY     0.025990
FB      0.013210
BABA    0.011239
AP      0.008925
dtype: float64

So throughout 2011 ~ 2015, AAPL made up about 4% of all the overall messages.¶

4% of 1.5 million+ observations is quite a lot and I'll have to keep in mind this fat tail of securities while determining my universe in the next part of this series.

Finally, I want to look at the consistency of these scores. Now, this part is a bit tricky because I'll already be defining a liquidity universe and such, but I'll only be looking at it from a bird's eye view and won't be reading too much into it, yet.

Consistency through Autocorrelation¶

Autocorrelation is a good measure for determining the turnover of a factor. A factor with low autocorrelation will consistently produce high portfolio turnover and vice versa for a factor with high autocorrelation.

"""
Now we look at autocorrelation.

Taken from the Factor TearSheet:

- https://www.quantopian.com/posts/factor-tear-sheet
"""
from quantopian.pipeline.data.psychsignal import aggregated_twitter_withretweets_stocktwits_free as dataset

class Liquidity(CustomFactor):   
    inputs = [USEquityPricing.volume, USEquityPricing.close] 
    window_length = 5

    def compute(self, today, assets, out, volume, close): 
        out[:] = (volume * close).mean(axis=0)
        
class Sector(CustomFactor):
    inputs = [morningstar.asset_classification.morningstar_sector_code]
    window_length = 1
    def compute(self, today, assets, out, msc):
        out[:] = msc[-1]
        
def factor_rank_autocorrelation(daily_factor, time_rule='W', factor_name='factor'):
    """
    Computes autocorrelation of mean factor ranks in specified timespans.
    We must compare week to week factor ranks rather than factor values to account for
    systematic shifts in the factor values of all names or names within a sector.
    This metric is useful for measuring the turnover of a factor. If the value of a factor
    for each name changes randomly from week to week, we'd expect a weekly autocorrelation of 0.
    
    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with integer index and date, equity, factor, and sector
        code columns.
    time_rule : string, optional
        Time span to use in factor grouping mean reduction.
        See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
    factor_name : string
        Name of factor column on which to compute IC.

    Returns
    -------
    autocorr : pd.Series
        Rolling 1 period (defined by time_rule) autocorrelation of factor values.
    
    """
    daily_ranks = daily_factor.copy()
    daily_ranks[factor_name] = daily_factor.groupby(['date', 'sector_code'])[factor_name].apply(
        lambda x: x.rank(ascending=True))

    equity_factor = daily_ranks.pivot(index='date', columns='equity', values=factor_name)
    if time_rule is not None:
        equity_factor = equity_factor.resample(time_rule, how='mean')

    autocorr = equity_factor.corrwith(equity_factor.shift(1), axis=1)
    
    return autocorr

def plot_factor_rank_auto_correlation(daily_factor, time_rule='W', factor_name='factor'):
    """
    Plots factor rank autocorrelation over time. See factor_rank_autocorrelation for more details.
    
    Parameters
    ----------
    daily_factor : pd.DataFrame
        DataFrame with date, equity, and factor value columns. 
    time_rule : string, optional
        Time span to use in time grouping reduction prior to autocorrelation calculation.
        See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for available options.
    factor_name : string
        Name of factor column on which to compute IC.
    """ 
    
    fa = factor_rank_autocorrelation(daily_factor, time_rule=time_rule, factor_name=factor_name)
    print "Mean rank autocorrelation: " + str(fa.mean())
    fa.plot(title='Week-to-Week Factor Rank Autocorrelation')
    plt.ylabel('autocorrelation coefficient')
    plt.show()

def construct_factor_history(factor_cls, start_date='2015-10-1', end_date='2016-2-1', 
                             factor_name='factor',
                             top_liquid=1000, universe_constraints=None, sector_names=None):
    """
    Creates a DataFrame containing daily factor values and sector codes for a liquidity 
    constrained universe. The returned DataFrame is can be used in the factor tear sheet.

    Parameters
    ----------
    factor_cls : quantopian.pipeline.CustomFactor
        Factor class to be computed.
    start_date : string or pd.datetime
        Starting date for factor computation.
    end_date : string or pd.datetime
        End date for factor computation.
    factor_name : string, optional
        Column name for factor column in returned DataFrame.
    top_liquid : int, optional
        Limit universe to the top N most liquid names each trading day.
        Based on trailing 5 days traded dollar volume. 
    universe_constraints : num_expr, optional
        Pipeline universe constraint.

    Returns
    -------
    daily_factor : pd.DataFrame
        DataFrame with integer index and date, equity, factor, and sector
        code columns.

    """
    
    factor = factor_cls()
    sector = Sector()
    liquidity = Liquidity()

    liquidity_rank = liquidity.rank(ascending=False)
    ok_universe = (top_liquid > liquidity_rank)  & factor.eq(factor) & sector.eq(sector)
    
    if universe_constraints is not None:
        ok_universe = ok_universe & universe_constraints
    
    pipe = Pipeline()
    pipe.add(factor, factor_name)
    pipe.add(sector, 'sector_code')
    pipe.set_screen(ok_universe)
    
    daily_factor = run_pipeline(pipe, start_date=start_date, end_date=end_date)
    daily_factor = daily_factor.reset_index().rename(
        columns={'level_0': 'date', 'level_1':'equity'})
    
    daily_factor = daily_factor[daily_factor.sector_code != -1]
    if sector_names is not None:
        daily_factor.sector_code = daily_factor.sector_code.apply(
            lambda x: sector_names[x])
    
    return daily_factor

class PsychSignalBull(CustomFactor):
    inputs = [dataset.bullish_intensity]
    window_length = 1
    
    def compute(self, today, assets, out, bui):
        out[:] = bui[-1]
        
class PsychSignalBear(CustomFactor):
    inputs = [dataset.bearish_intensity]
    window_length = 1
    
    def compute(self, today, assets, out, bi):
        out[:] = bi[-1]
        
class PsychSignalBullMessages(CustomFactor):
    inputs = [dataset.bull_scored_messages]
    window_length = 1
    
    def compute(self, today, assets, out, bum):
        out[:] = bum[-1]
          
class PsychSignalBearMessages(CustomFactor):
    inputs = [dataset.bear_scored_messages]
    window_length = 1
    
    def compute(self, today, assets, out, bm):
        out[:] = bm[-1]
        
def construct_factor(custom_factor, fn):
    factor = construct_factor_history(custom_factor, factor_name=fn, 
                             start_date='2014-1-1', end_date='2015-1-1',
                             top_liquid=500, sector_names=SECTOR_NAMES)
    return factor

Autocorrelation scores¶

print "Bullish Intensity Scores"
factor = construct_factor(PsychSignalBull, 'bullish_intensity')
plot_factor_rank_auto_correlation(factor, factor_name='bullish_intensity')

print ""
print "Bearish Intensity Scores"
factor = construct_factor(PsychSignalBear, 'bearish_intensity')
plot_factor_rank_auto_correlation(factor, factor_name='bearish_intensity')

print ""
print "Bullish Scored Messages"
factor = construct_factor(PsychSignalBullMessages, 'bull_messages')
plot_factor_rank_auto_correlation(factor, factor_name='bull_messages')

print ""
print "Bearish Scored Messages"
factor = construct_factor(PsychSignalBearMessages, 'bear_messages')
plot_factor_rank_auto_correlation(factor, factor_name='bear_messages')

Bullish Intensity Scores
Mean rank autocorrelation: 0.61063043217

Bearish Intensity Scores
Mean rank autocorrelation: 0.684057695603

Bullish Scored Messages
Mean rank autocorrelation: 0.81383441972

Bearish Scored Messages
Mean rank autocorrelation: 0.814544649859

Autocorrelation for the bullish intensity score is .61
Autocorrelation for the bearish intensity socre is higher at .68
Autocorrelation for the number of bullish messages is .81
Autocorrelation for the number of bearish messages is .81

Like I said before, I don't want to read too much into these scores because they're already defined into a liquidity universe and such but from first glance, the autocorrelation score for bullish and bearish messages seem support the idea that popular securities have the most coverage.

Finally, I'm going to do take a security (say the XLK Technology Sector ETF) and plot it's bullish and bearsh intensity side-by-side with the security's price.

# Only plotting for this year
random_sec = 'XLK'

start = pd.to_datetime("2013-01-01")
end = pd.to_datetime("2013-12-31")
sec_data = dataset[dataset.symbol == random_sec]
sec_data = sec_data[sec_data['asof_date'] >= start]
sec_data = sec_data[sec_data['asof_date'] <= end]
sec_data = odo(sec_data, pd.DataFrame)
sec_data = sec_data.set_index(['asof_date'])
sec_data = sec_data.fillna(0)

pricing = get_pricing(symbols(random_sec),
                      fields='close_price',
                      start_date=start,
                      end_date=end)

sec_data[['bullish_intensity', 'bearish_intensity']].plot()
(pricing - pricing.iloc[0]).plot()
plt.legend()

<matplotlib.legend.Legend at 0x7f7311859e50>

That's a little noisy, let's look at the rolling mean

pd.rolling_mean(sec_data[['bullish_intensity', 'bearish_intensity']], window=10).plot()
(pricing - pricing.iloc[0]).plot()
plt.legend()

<matplotlib.legend.Legend at 0x7f7311850390>

Just from plotting this one graph, I can start to notice a few patterns. For example, during the peaks after a drawdown, both bullish and bearish intensitys are quite high (i.e. btn April and My This fits my anecdotal experience where traders are both at their most excited and fearful on a sharp high. Something to keep in mind and explore during my hypothesis testing phase.

So there's a number of different things that I've observed from just looking at the data. This will be helpful in determining our out-of-sample and in-sample datasets as well as the securities I want to use in my universe. That will be done in part two of this series - Research Design - so stay tuned for that.

Here's a final summary of what I did in this notebook.

Overall Summary:¶

The number of securities covered has increased 17% from 2011 till 2015
- From 6,943 in 2011 to 8,114 in 2015
The financial services sector has the most amount of securities covered with technology, industrials, consumer cyclical, and healthcare following after.
The average number of observations have increased 136% from 2011 till 2015
- 693,384 observations in 2011
- 1,637,690 observations in 2015
Technology, financial services, industrials, healthcare, and consumer cyclical are the most popular sectors.
- This is interesting because the most amount of securities covered is in the financial services sector suggesting that a fewer number of stocks in the tech sector make up more of the observations.
On average there are more than twice the number of bullish messages than there are bullish
The average number of messages per security is more than twice the median number
- AAPL, on average, makes up about 4% of the total messages
- SPY, Facebook, and Alibaba and AP make up about ~10% of total messages on average
The number of bull messages per security is higher and more consistent than the number of bear messages per security
In two randomly sliced time points, score distributions remained relatively the same with bullish intensity scores densely packed around 2.0 with bearish intensity scores more widely spread, but centered around the same point.

	source	symbol	asof_date	bullish_intensity	bearish_intensity	bull_minus_bear	bull_scored_messages	bear_scored_messages	total_scanned_messages	timestamp	sid
0	stocktwits+twitter_withretweets	VTV	2014-04-11 04:00:00	0.00	0.00	0.00	0	0	2	2014-04-12 04:00:00	25909
1	stocktwits+twitter_withretweets	VUG	2014-04-11 04:00:00	0.00	1.67	-1.67	0	2	3	2014-04-12 04:00:00	25910
2	stocktwits+twitter_withretweets	VV	2014-04-11 04:00:00	0.00	0.00	0.00	0	0	3	2014-04-12 04:00:00	25911
3	stocktwits+twitter_withretweets	MAR	2014-04-11 04:00:00	2.40	0.00	2.40	5	0	23	2014-04-12 04:00:00	25920
4	stocktwits+twitter_withretweets	GTXI	2014-04-11 04:00:00	0.00	0.00	0.00	0	0	5	2014-04-12 04:00:00	25944
5	stocktwits+twitter_withretweets	TRW	2014-04-11 04:00:00	0.00	0.75	-0.75	0	1	1	2014-04-12 04:00:00	25948
6	stocktwits+twitter_withretweets	AIZ	2014-04-11 04:00:00	0.00	0.00	0.00	0	0	1	2014-04-12 04:00:00	25955
7	stocktwits+twitter_withretweets	DIG	2014-04-11 04:00:00	0.00	0.00	0.00	0	0	1	2014-04-12 04:00:00	25960
8	stocktwits+twitter_withretweets	TIK	2014-04-11 04:00:00	0.00	0.00	0.00	0	0	1	2014-04-12 04:00:00	25970
9	stocktwits+twitter_withretweets	DVAX	2014-04-11 04:00:00	2.24	0.00	2.24	1	0	4	2014-04-12 04:00:00	25972