The Social Media Trader Mood Series Pt. 2: Research Design

Research design is a fundamental and often over-looked part of the algorithm creation process. It is the simple point where the grounded quant will ask themselves, "What is my universe?"; and "What is my training/testing dataset?" These two simple questions lay a solid framework from which to frame and validate results from factor research and backtesting.

Here's an overview of what you'll learn from this notebook:

How to breakdown a list of securities by liquidity baskets
How to set guidelines for your universe of securities based on capital base and data coverage
How to validate your universe constraints with your in-sample datasets.

I highly recommend following this series in chronological order. The first part of this notebook covers the first and most important step of strategy creation: data examination. You can find links to each section below as they become available.

And in terms of pacing, the bolded section is where you are now:

Introduction - Examining the data. My goal here is to simply look at the dataset and understand what it looks like. I’ll be answering simple questions like, “How many stocks are covered?”; “Which sectors have the most coverage?”; and “What’s the distribution of sentiment scores?”. These are very basic but fundamentally important questions that lay the groundwork for all further development.
Research Design - Here, I’ll be setting up my environment for hypothesis testing define my in and out-of-sample datasets both cross-sectionally and through liquidity thresholds.
Hypothesis Testing - This is where I’ll be setting up a number of different hypotheses for my data and testing them through event studies and cross-sectional studies. The Factor Tearsheet and Event Study notebooks will be used heavily. The goal is to develop an alpha factor to use for strategy creation.
Strategy Creation - After I’ve developed a hypothesis and seen that it holds up consistently over different liquidity and sector partitions in my in-sample dataset, I’ll finally begin the process of developing my trading strategy. I’ll be asking questions like “Is my factor strong enough by itself?”; “What is its correlation with other factors?”. Once these questions have been answered, the trading strategy will be constructed and I’ll move onto the next section
Out-Of-Sample Test - Here, my main goal is to verify the work of steps 1~4 with my out-of-sample dataset. It will involve repeating many of the steps in 2~4 as well the use of the backtester (notice how only step 5 involves the backtester)

As this is my first time working through this flow, the steps above are subject to change as I learn and iterate through my mistakes. Feel free to post feedback and questions.

Quick Notes:

I’ll be using the Twitter & StockTwits dataset throughout this series. You can import these datasets through:

Pipeline: import quantopian.data.pipeline.aggregated_twitter_withretweets_stocktwits_free  
Research: import quantopian.interactive.data.psychsignal importaggregated_twitter_withretweets_stocktwits_free

The sample version is shown in the attached notebook and is available for both backtesting and research through January 7th, 2016.
The full version of this dataset is updated daily and includes availability for backtesting, research, and live trading.