Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Trading with Sentiment Machine Learning

This post will introduce an algorithm that incorporated Sentiment Analysis and Machine Learning. In this algorithm, 5 companies(Apple, Boeing, Intel, Merck and Google) were invested. The sentiment data was from FinSentS Portal provided by InfoTrie Financial Solutions(http://www.finsents.com). I downloaded the data from quandl(https://www.quandl.com/data/NS1-FinSentS-Web-News-Sentiment).

Theory Base
The post was primarily based on an assumption that sentiment data can be used to predict the stock price. This can be found from paper: Can Online Emotions Predict the Stock Market in China? (Zhou Z and Zhao J, 2016).

Here is the link: https://link-springer-com.ezlibproxy1.ntu.edu.sg/chapter/10.1007/978-3-319-48740-3_24

Paper Abstract:
In this paper, the authors explored the relationship between the relationship of sentiment from Weibo(Chinese Twitter) and fluctuation of stock price. Firstly, the author removed the data from non-trading days due to significant cut down of sentiment volume on those days. Secondly, 5 sentiment labels were employed in this analysis. The five labels are anger, sadness, joy, disgust and fear.

The mechanism of these five sentiment labels can be found in this paper: An emotion-based sentiment analysis system for Chinese tweets(Zhou.J, etc, 2012).

The correlation analysis showed that: anger has no prediction ability for stock price; volume neither; fear affects open price distinctly; sadness and joy affect the highest and lowest price respectively; Disgust mainly affect close price. At last, the author tested 3 classifiers' performance regarding prediction using sentiment data. The results showed that SVM had better performance compared to linear classifier.

FinSentS sentiment data Introduction

The data has 5 attribute values for each stock, each day:

  • Sentiment Score: a numeric measure of the bullishness / bearishness of news coverage of the stock.
  • Sentiment High / Low: highest and lowest intra-day sentiment scores.
  • News Volume: the absolute number of news articles covering the stock.
  • News Buzz: a numeric measure of the change in coverage volume for the stock.

Strategy Explanation

  • Data preparation:
    I used four types of sentiment data in this analysis: sentiment score, sentiment range, news volume and news buzz. I combined the sentiment high and sentiment low into one attribute: sentiment range. This is to reduce noise as we already had sentiment score. A range data can replace the high and low data and it should be enough to cover necessary information.

    Quantopian did not support the function of fetching history data uploaded by user. Hence, a customized method to fetch past data was employed. This method was mainly inspired by this post: https://www.quantopian.com/posts/method-to-get-historic-values-from-fetcher-data

    Generally, I used a post_func in the fetch_csv function. When reading the csv file, the past data was stored in a new column for each row. When I need those past data, I can just extract that column.

  • Cash allocation:
    For each day's trading, 5% of cash will be reserved, this is from empirical experience. In addition, for each stock, I will allocate at most one fifth of the rest portfolio cash. This is to avoid high leverage.

  • Training data set and testing data set:
    The training data is the 3 days' sentiment data before a specific day and the test data is the label of whether the stock will rise or fall in that specific day. For each stock at each day's trading, 96 pairs of past training data set and testing data sets were prepared to be trained for models.

  • Model fitting and prediction:
    In this algorithm, 4 common classifiers were employed: NuSVM, LinearSVM, Random Forest and Logistic Regression. To further eliminate noise and unnecessary tradings, I introduced a voting system. The trading action will only be implemented when 3 out of 4 or all classifier agree to long or short a stock.

  • Adding moving average:
    To further employ the advantages of technical analysis, a simple moving average strategy was employed. The basis of this strategies is when the short-term moving average is above the long-term moving average, the asset price is in an upward path. Hence, we want to follow the trend and long the asset. Conversely, we want to short the asset in the other case.

Results Evaluation
Basically, the algorithm's performance is quite good. Its beta is not very high and it also did not show high volatility. One of the drawbacks of the back test is that the whole period is a upward period for S&P 500 and it cannot show much about how this algorithm react to bad markets.

Future Improvement

  • Larger universe and lower position concentration .
    Due to quantopian's limitation on csv file size that can be fetched, I can only upload 5 stock's sentiment data. If there is a way to connect to FinsentS API or a solution to feed the sentiment data into the algorithm, a better and stable performance will be anticipated from larger universe.

  • Equal long/short exposure.
    Further calculating the probability of rise and fall for each stock is possible. After ranking, I can either choose the top few stocks to trade or set a benchmark for transaction. This action can be made to balance the long and short exposure. In this algorithm, because I only have 5 stocks, I did not implement this strategy. But it is worth trying for larger universe.

  • Using the non-trading day sentiment data
    In this algorithm, non-trading day sentiment data was not used due to low volume. But logically, it is worth trying to use non-trading day sentiment data.

  • Deep learning for time series data classification
    To simplify the model, I used past 3 days's sentiment attribute data to do the prediction. I just fit the model by a set of attribute information. However, more advance algorithm is encouraged since the attribute data set is also a set of time series data. Possibly, deep learning with latency factor can be employed. I will study further on this topic.

Reference
Zhao, J., Dong, L., Wu, J., Xu, K.: Moodlens: An Emoticon-based Sentiment Analysis System for Chinese Tweets. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 1528–1531.

Zhou Z, Zhao J, Xu K. Can Online Emotions Predict the Stock Market in China? International Conference on Web Information Systems Engineering. Springer International Publishing, 2016. pp. 328-342.