Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Seeking Feedback on Trading based on Sentiment Analysis + Intro

This will be a bit of an intro + seeking some feedback. My name is Harrison! Big fan of Python, been programming with it for quite some time, run my own businesses with it for a living, do automated bitcoin trading with it, and looking to possibly do some automated trading with stocks and sentiment, using the sentiment data from one of my companies, http://sentdex.com. I teach python programming on http://pythonprogramming.net as well as http://youtube.com/sentdex.

Seems like a nice community here, and the service is very impressive. The speed with which we're able to execute free back-tests against minute data over the course of 13 years is just amazing...and then getting all these other metrics that are tracked on top of that. Way too cool. Then we add the fundamentals...then the fetcher... whoooo, what a nice place to tinker for a data analyst

I am still learning the ropes here, but I think this strategy is simple enough to not have too much wrong, please feel free to point out any mistakes though. I'd like to put some money down, but trying to paper trade first. I had a really hard time initially not taking on accidental leverage, but, after some tweaking, that's all fixed in the basic version, not really sure about the 2nd version with shorting how I am going to control the money.

I have two major versions at the moment, both trading on sentiment signals from http://sentdex.com, where I made a page that the fetcher here could access and easily read.

I made a sample signals file under: http://sentdex.com/api/finance/sentiment-signals/sample/

These signals are also based on an algorithm, which is applied to the raw sentiment database, which also has a free download for anyone curious enough: http://sentdex.com/static/downloads/stocks_sentdex.zip. Both of these span from the inception of Sentdex with all of the data for the main finance section (starting late Oct 2012 to June 15th). The sentiment readings... you guessed it, are based on a natural language processing algorithm! Lots of layers of complexity, but the strategy is simple enough.

The daily signals are generated at 1300GMT, or 30 minutes prior to market open (my servers are on GMT time). From reading the docs, I see I need to have my csv to be fetched ready by midnight EST, so that's kind of a bummer, but doesn't really affect the back-tests at least.

One back-test version only buys/sells based on signals (buys 6 and sells at 0 or below signal), no shorting, the other version also does some shorting. The shorting version does almost twice as good, but winds up taking ~5x leverage (compared to no leverage on the basic strategy), not sure if it's based on purely the act of shorting being a sort of leverage... or if I am actually truly 5x leveraged, but I would like to not have leverage.

Unfortunately, I do not know how to post more than one back test at a time, so I will post here and then reply with the next. I assume a double post is better than a double thread.

Also, the context.stocks are chosen by selecting from the top 200 highest sentiment-volumed stocks. I'd really rather not have it this way, but I quickly found the limit was 255. The only major one that is missing is GOOG / GOOGL, but some newer ones weren't found, since I had to pick one major starting point. In the live version, I wouldn't have this problem.

Any help or suggestions are appreciated, especially in the shorting version, but feel free to play with the data. The sample data is free to goof around with.

13 responses

Okay, so here's the shorting version. Mostly curious what I can do to make sure I am not taking on any leverage, besides the actual short, but I would still like to do something like:

Say I have 1 million USD.

I can have
100k invested in AAPL
100k invested in GRPN
100k invested in WMT
100k invested in BAC
100k invested in C
100k invested in JPM
100k invested in TSLA

so 700k total long, then

300k total shorted like:

100k short MCD
100k short FB
100k short SBUX

... so, even though a short is really a sell and a promise to buy back later, I'd rather track it as if it were a position, and it seems like quantopian counts shorts as "positive" positions from what I've seen looking at the positions via context.portfolio.positions.

Good to see you on here, Harrison. I've depended pretty heavily on your tutorials.

Not sure if you have seen the below post. They are doing something similar.

https://www.quantopian.com/posts/accern-event-driven-earnings-focused-news-and-blog-backtest-results-pdf-report-attached

@ Jamie Lunn

Nope, I had not seen that. Very cool! Thanks for sharing this. I think event and sentiment/opinion mining investing/trading is the future, if not the current.

I haven't been here very long since finally signing up. I had browsed Quantopian quite a few times in the past, but it never stuck with me and my interests. With the fetcher and these Morningstar fundamentals (fingers crossed for history() to one day apply to those..and fingers crossed about a fetcher for live data that runs a bit closer to market open), it seems like it's shaping up to be a great place for me to mess around on, maybe enter the competition at some point.

Welcome Harrison !!
good to see you on here ...
Big fan of your tutorials :)!!!

@ Andrew Czeizler

Thanks for the welcome! Glad to hear you've enjoyed the tutorials! Should have some Quantopian ones soon.

Wicked!!!
looking forward to it!!!!

super excited!!!!

Do you have any sample ASX data?

I have looked
but most i have found is very expensive...
maybe somebody on the forum can help...

@Jeremy Steele,

Not situated for Quantopian/trading signals, no, but I do have raw sentiment data. Trading signals could be done with present data, some of which you can visually see here:
http://sentdex.com/financial-analysis/asx/

...but I am not really all that happy with my Aussie sources currently plugged in, as the data that is coming in is far too infrequent. Someone did pass me some sources a while ago to use, but I have just been too packed with other things to set it up.

I have some graphs and existing data for the ASX, but you can see they are fairly sparse. I can pass you the raw sentiment database if you like in the form of a CSV.

This would be the raw sentiment readings, which you may or may not be comfortable dealing with. I make moving averages out of them. Best results are usually dynamic moving averages based on volume, but usually I divide by numbers far too large for my ASX data, so I don't really have any serious suggestions for good numbers to use. I mostly just track ASX "just in case" I want to build a serious platform on top of it too eventually. I am kind of a data hoarder. It's about 1.5M lines, with data going back to late July 2013, but it should also be noted this was tracking the ASX300 AS OF July 2013. I haven't updated any of the companies there that I can recall.

@Harrison Kinsley

This is data I would really be interested in seeing/building with as an Aussie local, but at $300/month I wouldn't be a paying client so don't go out of your way for me.

Definitely a project I would be excited to work on with/for you.

What modules are you using for the sentiment analysis? LIWC? Weka? or one of the scikits python libraries?

@ Jeremy
There are plenty of free versions of the data, especially for manual traders, though, at some point there must be a charge. I may target types of people with varying API types down the line, but it's not at the top of the to-do list. I am mostly looking for institutional-level people. There used to not be an API, but businesses have been asking for them in various forms, so I am just supplying the demand I get. I already tried a cheaper solution to satisfy most individuals, but wasn't happy with the amount of load required to satisfy, both in processing and in customer service / questions / whatever else. I think some people underestimate just how much goes into making something like this.

The classifier was originally trained with some help from scikit-learn, otherwise it is all in-house, including the NLP, spidering, etc. The classifier itself now is no longer an sklearn classifier either. Some of the Geo stuff I do uses NLTK for named entity recognition and for noun chunking, since they have some very nice tokenizers.

Hey Harrison,

Thanks for the post, blog & video, the rest of your Python videos look super interesting!

The Sentdex API looks pretty cool, do you site the full list of "over 20 sources" anywhere, or is there a way to filter down by quality? I also saw that you have the data available more frequently than daily, what's the variance like between publish time to api time & how have the sources/quality/availability been changing since 10/2012?

Poking around Sentdex gave me the idea of following moving averages of economic sentiment by geographic region, the Political & Geographic pages on Sentdex look pretty US focused & don't expose a data sample, but have you tried anything interesting with sentiment of non-ticker specific data or could you make any data samples available?

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@ Adam Blackwell

I don't have a full list, because it changes, but they are coming mostly from yahoo finance's news feeds. The main constituents are found here: http://sentdex.com/how-sentdex-works/, Reuters, Bloomberg, WSJ, LA Times, CNBC, Forbes, Business Insider, and Yahoo Finance. If you go to finance.yahoo.com, and look through the sources there, that's where almost everything is coming from.

The closest source to "blog" or social media that Sentdex touches is Seeking Alpha for finance. It's my opinion that social media is rather useless for finance, and my findings from researching it said the same. There was some promise in what I found with something like StockTwits, but nothing that compares to straight finance journalism.

At the moment, almost nothing from the original source is stored, aside from some noun phrases and related words, which is a relatively new development, in an attempt to create a spidering algorithm, like what Google does with links, only, instead of links, doing it with concepts. In this recent development, I am also storing the source link, more for the consumer to follow up on the links, but I suppose it could be used as well to filter out sources, though this isn't tied into the main sentiment analysis database, it's a separate project at the moment. URLs take up a lot of space in a datebase over time, which is why I have refrained from using them. I used to have them, but had no purpose for them, so it wound up being a cost decision to remove them.

The crawlers are running constantly, I have multiple servers, and the main crawler is threaded into about 30 crawlers, which break down to track various numbers of companies by their update volume. For example, one crawler tracks about 30 of the lesser-reported companies at once, while another single crawler is dedicated solely to tracking AAPL. Usually, updates come in within 60 seconds of being posted, but, it can possibly take slightly longer for the companies that are tracked by crawlers that track multiple, slower, companies. If say 10 were updated at once, it may take more than 60 seconds, but it's pretty constant. Some sources might have changed, especially some of the smaller ones, but the main source as being the Yahoo Finance feed (not to be read as "yahoo" is the main source, yahoo syndicates financial news reports, so you can go to here: http://finance.yahoo.com/q?s=aapl and see "headlines" from various sources for AAPL). The sentdex algorithm has remained unchanged since inception.
The API is a direct connection to the database, so the the API should not have any sort of delay to it.

Geographic sentiment is solely generated with the Twitter API, so this is social media. You could model it if you wanted. You can also search specific keywords on the globe via this link: http://sentdex.com/geographic-sentiment-search/

It will take some time to load that link, which is why it's not really linked to on my site. It's just something I've been developing, but you could use it to search for various trends. For example, weather can affect various options markets, so you could search for people's sentiment for "weather," like this http://sentdex.com/geographic-sentiment-search/?q=weather, or something like that. That's a 1% firehose, however. To get really epic, I'd need something better than that.

For articles, I do not think geolocation of the author is too relevent. For some topics, it can be, but I don't think it carries enough weight in finance, so really the geolocation matters most when using social media data.

You can also search for geo data on a specific company, like Google or something: http://sentdex.com/geographic-sentiment-search/?q=google .. but again, I am not sure how valuable this one would be. So far, the best thing I have thought of is for related topics like weather sentiment, where negative would be severe heat, droughts, too much rain...whatever... and how it would affect crops in those areas, and then subsequently affect futures / opts prices.

You could also track "economy" or something: http://sentdex.com/geographic-sentiment-search/?q=economy

If you track a specific term as well, you will get up to 10% of the firehose. There are a lot of singular finance-based keywords one could track. Unfortunately, Sentdex is a 1-man operation, and the budget is not much. Running twitter data is actually pretty expensive. Their API sucks up a ton of ram. Via my home PC, I have no problem running tons of data, but, via a VPS, I am always maxing out RAM before the max from the free API.

On to politics data, I have exposed all sources there, actually. The source list for politics is fixed, as the amount of politics data gets absurd fast! Again, they are listed here: http://sentdex.com/how-sentdex-works/, and that list is: CNN, Fox, USA Today, ABC, MSNBC, CNBC, CBS, Huffington Post, Yahoo Politics, Washington Post, and Reuters.

Also the political sentiment is probably the most source-transparent of everything so far. The aforementioned concept crawler was first born here, and some of the fruits are public there. This same update is actually "live" for stocks, forex, and commodities, though the UI/UX isn't, same for the existence of FX and commodities!

So if we go here: http://sentdex.com/political-analysis/ and choose maybe "war". You wind up her:e http://sentdex.com/political-analysis/?i=war&tf=30d, which gives you the 30 days of sentiment for "war."

Then, above the graph, you see those words, which are colored and sized. Color is the sentiment, and the size is volume. You can then click on "Charleston" as an example and actually get sources that contributed to the political topic of "war" around the sub topic of "Charleston." These are the direct source links that you can click and visit.

I have kept the "political" sentiment pretty much separate from the finance stuff, you are also absolutely right in your assumptions that it is US-based. Everything can be expanded, but it's all a question of funding. I am super excited about the geographic sentiment, but running it accounts already for about 95% of the costs to run all of Sentdex. In the grand scheme of things, Sentdex is very cheap still for me to run. At the moment, I am trying to get some APIs and data selling out there, to fund some more expansions and ideas of mine. The sky is the limit.

As for political data samples, there's one major database, I usually just give that out when people want to play around. There's a second growing database for the concepts work, but that area is still in its infancy.

As I was telling Josh and Seong earlier today, the sentiment signals are very new, as new as my account here. Sentdex has historically just been a pet-project of mine, mostly to suit my own specific curiosity, without really focusing on what others might actually want. The sentdex database of raw sentiment has been available for a while, but almost no one could figure out what to do with the raw sentiment numbers, and almost everyone with a connection was a doctoral student doing a dissertation... or at least someone willing to claim as much so I would go easy on my pricing. People kept asking me how to read sentiment, and I've always had my own rules, but converting them to a super simple strategy took some work and understanding about the system.

I had back tested some strategies with the Sentdex algo, and it did well, but my back testing code was never anything to write home about. Then I decided to try out quantopian, read about the fetcher and wrote that signals API real quick so I could work with the fetcher. I realized that the sentiment signals API is probably exactly what people have been hoping for all along. Something nice and simple that can be plugged in without needing to really come to grasps with the Sentdex system at all.

Same thing with the sentiment graphs on site. That used to be all of what sentdex was, just the historical graphs of the MAs applied to raw sentiment... but people don't seem to like them or want that. I am a data fiend at heart, so I like it... but people just want the signals. Some people want historical stuff to test against, but they want nice clean data, never the raw sentiment, except for a few people. Been trying to satisfy that a bit more lately, mostly so I can satisfy my own interests some more with cool new projects with the data.

Thanks for your interest!