Hi Luc,
Many thanks. I will contact you directly over the next few days.
One of the things I found with ML in general (of whatever type) is that usually the quality and success of the output is very strongly dependent on exactly what you do as pre-processing before the inputs actually go into the ML / AI. The more you can "help" the ML to get started in the right direction, the better, so that it can focus its efforts on the "important stuff" and doesn't have to waste its time trying to figure out (perhaps unsuccessfully) how to do something that we could have just told it beforehand. Specifically, in this case, you have 2 "raw" inputs, namely the number of bullish tweets and the number of bearish tweets. Although at first glance these might seem like logical choices for input to ML, actually the problem with using these 2 items as they are, is that both of them contains a mixture of 2 different types of info, namely 1) Bearishness vs Bullishness and 2) Changing levels of Enthusiasm for tweeting. My suggestion is to do a little bit of pre-processing to separate these two different aspects BEFORE inputting the data to the ML, as follows:
a) Count Total tweets = Bullish tweets + Bearish tweets.
b) Count NET Bullish tweets = Bullish tweets - Bearish tweets (will be +ve if predominantly Bullish, -ve if predominantly Bearish)
c) Proportion NET Bullish Tweets = a) / b). This is now normalized relative to the total number of tweets and becomes more purely a measure of + or - sentiment itself.
d) Take the Average number of NET Bullish tweets, and then from this take the ratio of Current NET Bullish Tweets to Average NET Bullish tweets. This gives a Short-term measure of day-to-day variability in bullishness or bearishness.
e) Slope = trend of NET bullish tweets, gives a Longer-term measure of changes in bullishness or bearishness.
Items c) d) and e) are now all normalized with respect to the total number of tweets and would potentially be useful as inputs to ML. The confusing factor of how many people are currently tweeting (either way) has been removed. We now have 3 inputs rather than the original 2, and they now contain info in a slightly different form that should be easier for any ML to work with. Please could you try this and see if it helps?
The other thought I have is that irrespective of whether you are using Naive Bayes or SVM, and Gaussian or Linear or RBF models, these are all Classifiers which, at least as I understand it, are designed to give a binary 1/0 output. Now is this really what you want? If you only want to decide Long or Short, then OK, but in fact we can probably do much better than that. In the context of an Equity Long-Short strategy with a large universe of possible equities to choose from, what we would actually like is a Ranking of all the equities on a continuous (rather than a binary) scale. The we go Long on the N= however many we want top-ranking (most bullish) equities, and go Short on the N bottom ranking (most bearish) equities. So, to do that, what we would need is something that gives a continuous-valued output rather than a 1/0 output.
Cheers, best wishes, Tony
(also on Linked-in, see Tony Morland)