Algo with Support Vector Machine in Pipeline

Back to Community

posted Sep 29, 2017

I wrote a base algo that incorporates machine learning in the pipeline. i.e. as the pipeline runs, it trains a ML model PER STOCK and comes up with a prediction on the stock's movement. The algo can then use the output of the pipeline and long the predicted up stocks and short the predicted short stocks.

As it stands, this algo does not perform well, but it can serve as a basis for someone else.

/Luc Prieur

29 responses

Thanks!

Thank you for this

Hi Luc,
As i am still basically "non-pythonic" (especially pandas) and i need to get there first before I can do much, i had not planned on diving in to either ML or alternative data from social media for quite a long while yet, but this thread and your algo look very interesting.

Please excuse me if some of my questions seem naieve or simply wrong, due to my limited knowledge of python.

From what I can infer from the little that i understand of the code in your algo, i think that you are taking 2 different sets of inputs from social media (aggregated_twitter_withretweets_stocktwits), one being bullish tweets and the other bearish tweets, and then taking the average number of each type over a rolling window and calculating the slope of those two average numbers of tweets, and then using those 2 items as input to your SVM. Am i more-or-less correct so far?

It also looks like you are taking SMA5 & SMA10 of price, but i can't quite figure out if these are also inputs to SVM or simply used in the filter. [I'm sorry if this sounds stupid on my part but, as I said, I'm new to python]. Anyway, at least as i understand it, you are using the SLOPES of the average numbers of bullish & bearish tweets, and possibly also the VALUES of the two SMA's as ML inputs. Please set me straight if i have this wrong so far.

Do you have any descriptive (i.e. other than python code) documentation of what you are doing that you could share? Perhaps then we could talk some more.

Despite my weakness in python, i have been trading for more than 30 years, i have at least some experience with SVM, and quite a lot of experience in the problems of using ML in trading systems, at least in the context of old-fashioned Neural Networks. Although there are obvious differences, there are also some similarities in the practical problems that one has with regard to good choices of indicators to use as input for any type of ML, and also with the issue of pre-processing.

I look forward to understanding more about what you are doing and sharing ideas.
Best regards, Tony

luc prieur

Nov 1, 2017

Tony,

The SMA5 and SMA10 are useless as well as filtering the stocks on SMA5 being larger than SMA10. I should have removed that bit of code. It is not within the scope of using ML in the pipeline. I had picked up that small bit of code from another algo posted in the communities.

As far as your understanding of the ML bit, you understand it correctly. I am using the slope of tweets (bearish and bullish) as input to the ML and the target of the ML is price movement of the following day for said equity.

I copied some of the ML code below.

Here I select data for the two features:
features = ['bull_rs', 'bear_rs']

This is the last row of feature data. It must be used to predict tomorrow's movement. Hence I call it live.
X_live = df[features][-1:]

This is the reminder of the data that I shift one day forward so that the previous day of data is aligned with the
current day price movement, i.e. "result". result is +1 if there was a gain, -1 for loss.

            df[features] = df[features].shift(1)

I remove NaN row (Should be only the first row of data.).
df.dropna(inplace=True)

Scale data
X_train = scaler.fit_transform(df[features])
Set target.
y_train = df['result']

Specify model. Funny, I mentioned SVM, but use Naïve. Changing model is just a dropin.
model = GaussianNB()

Here I train the model and run the prediction for tomorrow in one line.
prediction = model.fit(X_train, y_train).predict(scaler.transform(X_live))[0]

The whole algo is just presented as a template for others to build on. I did not try to make perform or anything.

BR/

Luc

luc prieur

Nov 1, 2017

You can contact me directly on linkedin if you wish.

https://www.linkedin.com/in/lucprieur/

Tony Morland

Nov 4, 2017

Hi Luc,

Many thanks. I will contact you directly over the next few days.

One of the things I found with ML in general (of whatever type) is that usually the quality and success of the output is very strongly dependent on exactly what you do as pre-processing before the inputs actually go into the ML / AI. The more you can "help" the ML to get started in the right direction, the better, so that it can focus its efforts on the "important stuff" and doesn't have to waste its time trying to figure out (perhaps unsuccessfully) how to do something that we could have just told it beforehand. Specifically, in this case, you have 2 "raw" inputs, namely the number of bullish tweets and the number of bearish tweets. Although at first glance these might seem like logical choices for input to ML, actually the problem with using these 2 items as they are, is that both of them contains a mixture of 2 different types of info, namely 1) Bearishness vs Bullishness and 2) Changing levels of Enthusiasm for tweeting. My suggestion is to do a little bit of pre-processing to separate these two different aspects BEFORE inputting the data to the ML, as follows:

a) Count Total tweets = Bullish tweets + Bearish tweets.
b) Count NET Bullish tweets = Bullish tweets - Bearish tweets (will be +ve if predominantly Bullish, -ve if predominantly Bearish)
c) Proportion NET Bullish Tweets = a) / b). This is now normalized relative to the total number of tweets and becomes more purely a measure of + or - sentiment itself.
d) Take the Average number of NET Bullish tweets, and then from this take the ratio of Current NET Bullish Tweets to Average NET Bullish tweets. This gives a Short-term measure of day-to-day variability in bullishness or bearishness.
e) Slope = trend of NET bullish tweets, gives a Longer-term measure of changes in bullishness or bearishness.

Items c) d) and e) are now all normalized with respect to the total number of tweets and would potentially be useful as inputs to ML. The confusing factor of how many people are currently tweeting (either way) has been removed. We now have 3 inputs rather than the original 2, and they now contain info in a slightly different form that should be easier for any ML to work with. Please could you try this and see if it helps?

The other thought I have is that irrespective of whether you are using Naive Bayes or SVM, and Gaussian or Linear or RBF models, these are all Classifiers which, at least as I understand it, are designed to give a binary 1/0 output. Now is this really what you want? If you only want to decide Long or Short, then OK, but in fact we can probably do much better than that. In the context of an Equity Long-Short strategy with a large universe of possible equities to choose from, what we would actually like is a Ranking of all the equities on a continuous (rather than a binary) scale. The we go Long on the N= however many we want top-ranking (most bullish) equities, and go Short on the N bottom ranking (most bearish) equities. So, to do that, what we would need is something that gives a continuous-valued output rather than a 1/0 output.

Cheers, best wishes, Tony
(also on Linked-in, see Tony Morland)

luc prieur

Nov 17, 2017

Tony,

As far as feature engineering, yes, your proposal sounds good. My post was not to propose the best features to use, but rather a template algo for others to build on.

For usage of classifiers instead of regressors, of course one could use a regressor and use the output amplitude to run a long-short strategy. In my experience however, I found that ML has less problems trying to guess a direction rather that an amplitude (i.e. tomorrow's stock price). One could use the classifier's confidence level in a long-short strategy.

I encourage anyone with an improved algo based on mine to publish it in this thread.

/Luc

luc prieur

Jun 6, 2018

I have played a little bit more with this base algo and used PyshSignal twitter sentiment as feature to the SVM classifier. As well, I am using past returns as features.

So, as it stands now, on each day, for each stock in the pipeline, the algo trains an SVM classifier on past performance and spits out a prediction. That prediction is used as weight for the optimizer. The whole thing works fine unless you turn on slippage and commission, then not so good. That is normal as it trades every day.

If anyone has any idea as to how to improve this, please post your modified algo.

Thanks.

Ivory Ant

Jun 6, 2018

Tried to see if this is viable for a long/short algo with 250 positions on both side, so I applied your 'scanned twitter messages' filter to the QTradeableStocksUS universe, limited to 500 stocks total:

    my_screen = QTradableStocksUS() & SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=60).top(500)

This seems to yield about 300 stocks per day (probably not more in the twitter data feed?), but it times out so this is currently a dead end for real long/short algos.

Daniel Cascio (Blue Robin)

Feb 27, 2019

I've been toying around with your code and noticed that it held up well until approx Oct-31-2018 & Nov-01-2018 when something funky went on. Maybe another set of eyes can help me navigate that downturn without introducing too much over-fitting.

luc prieur

Feb 27, 2019

Daniel,

I am surprised to see such big jump in the P&L curve. Something weird is happening. I'll check. I like the changes you have made to the original code.

/Luc

James Villa

Feb 28, 2019

Hi Luc, Daniel,

I'm glad to see this thread bumped because it reminded me to revisit a variant of Luc's base algo I did back in June 2018 that passed all contest constraints. Made some changes to inputs, training window, constraints but kept standard SVC routine and limited it to trade approximately 95-100 stocks to avoid timeout issues. The live portion of the tearsheet below is just the updated period from which I last ran the backtest, kinda OOS. What do you guys think, overfitted?

luc prieur

Feb 28, 2019

James,

It is difficult to say if it is over fitting. There are not many tuning parameters aside from the training window length, the SVC parameter C (if used, but in my original code, it was set to default, i.e. 1.0) and the choice of features. The universe selection is just there to lower the computation time. It is possible that the current market regime is still being learned by the SVC. As such, I don't see how it could be "manually tuned" to over fit.

Make sure that the features you use (or engineer) are stationary in time and that the training window is not too short. Non stationary features are not handled well in classical ML (e.g. if you were to use stock prices instead of returns) and that the training window is long enough such that the SVC is training on a sufficient amount of data.

Thanks for posting your live results. If you could post an older version of your code, that would be nice as well.

James Villa

Feb 28, 2019

Hey Luc,

Thanks for your feedback. I pretty much agree with you that it might be too early to judge overfitting based on the OOS period which was a very tough period including the major correction of Dec. 24 and subsequent slow recovery. If you look at the tail end of OOS performance it looks like it's bouncing back!

My major problem here is limited computation resources that results to timeout issues. The training window is 252 days and limit number of positions to 90-100 for it not to timeout. Given this configuration setup, I get portfolio volatility of ~10%, something I know I can lower if say I can extend training window to 756 and trade 500 positions without timeout issues.

Key here is the inputs, the secret sauce. As to code revision, it' pretty much the same as what Daniel did with TargetWeights and its added constraints, that's it!

Daniel Cascio (Blue Robin)

Apr 15, 2019

@ James, i found myself circling back to this thread and wondering if you have any guidance you can share related to the below...
For example, say i have the below code of two factors: 1) Predictions (SVC) and 2) Alpha(regression). Can I / How would I, use the output from the Alpha custom factor as an input into the Predictions custom factor in addition to the total scanned data already being used?

def compute_slope(a):  
    x = np.arange(0, len(a))  
    y = np.array(a)  
    A = np.vstack([x, np.ones(len(x))]).T  
    m, c = np.linalg.lstsq(A, y)[0]  
    return m

class Prediction(CustomFactor):  
    def compute(self, today, asset_ids, out, total_msgs, alpha_f, returns):  
        predictions = []  
        for i in range(returns.shape[1]):  
            try:  
                result = (returns[:, i] > 0) * 2 - 1  
                df = pd.DataFrame(data={'total': total_msgs[:, i].flatten(), \  
                                        'alphas_FF': alpha_f[:, i].flatten(), \  
                                        'result': result.flatten(), \  
                                        'returns': returns[:, i].flatten()})  
                df.fillna(0, inplace=True)

                # before shifting, we must record the last values as they will be used  
                # to run the model on for the prediction.

                df['total_'] = df['total'].rolling(window=5).apply(compute_slope)

                df['total_feature'] = df['total_'].pct_change().shift(1)  
                df['alpha_features'] = df['alphas_FF'].shift(1)  

                df['returns-1'] = df['returns'].shift(1)  
                df['returns-2'] = df['returns'].shift(2)  
                df['returns-3'] = df['returns'].shift(3)

                scaler = MinMaxScaler()  
                features = ['total_feature', \  
                            'returns-1', 'returns-2', 'returns-3']  
                X_live = df[features][-1:]  
                df[features] = df[features].shift(1)  
                df.dropna(inplace=True)  
                X_train = scaler.fit_transform(df[features])  
                y_train = df['result']

                prediction = SVC(). \  
                fit(X_train, y_train). \  
                predict(scaler.transform(X_live))[0]

                predictions.append(prediction)  
            except ValueError:  
                predictions.append(0)  
        out[:] = predictions

class Alpha(CustomFactor):  
    inputs = [USEquityPricing.close]  
    window_safe = True  
    def compute(self, today, assets, out, close):  
        returns = pd.DataFrame(close, columns=assets).pct_change()[1:]  
        spy_returns = returns[symbol('SPY')]  
        # get beta and alpha by running linear regression  
        A = np.vstack([spy_returns, np.ones(len(spy_returns))]).T  
        m, p = np.linalg.lstsq(A, returns)[0]  
        out[:] = p        

def custom_pipeline(context):  
    m = QTradableStocksUS()

    alphas_f = Alpha(window_length = 60)  
    m &= SimpleMovingAverage(inputs=[st.total_scanned_messages], window_length=21).top(100, mask=m)

     # Filter for stocks that are not within 2 days of an earnings announcement.  
    #m &= ~((BusinessDaysUntilNextEarnings() <= 2) | (BusinessDaysSincePreviousEarnings() <= 2))

    # Filter for stocks that are announced acquisition target.  
    #m &= ~IsAnnouncedAcqTarget()            

    prediction = Prediction(inputs=[st.total_scanned_messages, \  
                                    alphas_f, \  
                                    DailyReturns()],\  
                                    window_length=100, mask=m)

Daniel Cascio (Blue Robin)

Apr 15, 2019

I think i answered my own question with the attached revised code snip it. I just hope it does not timeout while running

James Villa

Apr 15, 2019

Hi Daniel,

Haha, yes you just did answer your own question.! Given that you mask to trade on 100 stocks with 4-5 features and training period of 100 days, you shouldn't time out, My max without timeout error on 4-5 features, 252 days of training, trading 250 stocks. Good luck! Kindly post interesting results.

luc prieur

Apr 15, 2019

I second that as well. That would work. Thanks for posting.

Daniel Cascio (Blue Robin)

Apr 16, 2019

Here's a first cut at what I've come up with so far. I need to have a closer look into the leverage dropping below the min limits quite often. However, the only constraint added was to sector exposure +- 10% and position concentration limited. Next steps, I may look to adding : 1) Add one more feature 2) Increase the training period 3) Increase number of positions. I've held out testing past 1/31/2017 as i would like to use the rest as OOS testing.

Note, the results in the attached notebook are not derived from the exact coding snippet I posted above but similar in that it revolves around sentiment

If anything sticks out to anyone I encourage the feedback (good or bad) :)

James Villa

Apr 16, 2019

Hi Daniel,

Nice first cut! Regarding leverage dropping below min limits, this is a sign that your factor based stock selection is biased on one side. You can try and add maximum leverage = 1 and/or dollar_neutral constraint and see if it fixes this problem. I would try first increasing your training period and number of positions before adding more feature. See if it lowers volatilility and increase Sharpe. Did you try and see if the results on holdout data (after 1/31/2017) is consistent to training results?

Daniel Cascio (Blue Robin)

Apr 17, 2019

Thanks James. I had the max lev set to 1.0 but left dollar_neutral unconstrained. Using the below code brought my leverage within the contest criteria. I agree adding features would be last on the list mentioned. I haven't tested after 1/31/2017 but I will run a tearsheet now to see.


    constraints.append(opt.MaxGrossExposure(1.0))  
 
    constraints.append(opt.DollarNeutral(tolerance = 0.005))

Daniel Cascio (Blue Robin)

Apr 17, 2019

Attached is a tear sheet from OOS data between 1/1/2017 - 4/15/2019. The results appear to hold up reasonably well between the two periods. Note the starting capital was set @ $10mm for each. Some takeaways for me are to smooth out the Annual Volatility and increase number of positions.

James Villa

Apr 17, 2019

Great, Daniel,the OOS holdout results is pretty consistent with training results. In my experience, once you've established the model to have some predictive power, it can be calibrated to its optimal performance within the limit of its constraints. As you scale the number of positions, volatility should be tempered and this is key because you are targeting risk control. The objective should be higher risk adjusted returns.

luc prieur

Apr 18, 2019

Hi All,

I have been revisiting this algo. Keep in mind that it was meant to be an algo for someone else to use as a code base. In no way was it meant to trade. That said, one of the big problem now is that the selection of the universe was made pretty much "ramdomly". I just used pretty much any filter code available to reduce the universe to 50-60 stocks such that the algo would not time out. So using number of tweets as filter has no underlying logic to it. Its random and it happens to work ok. I have tried changing the universe to something more basic, like "top 50 market cap) and the whole thing falls apart.

I guess is that someone would need to find a universe that makes sense for the SVM pipeline that works. Otherwise, this is just overfitting or lucky. Or maybe, at least find some plausible reason why using message numbers works.

Daniel Cascio (Blue Robin)

Apr 18, 2019

James, do you have any tricks or tips I could leverage from that you used to scale the training period to 252 days and 250 positions? So far I've only been able to train on up to 126 days and couldn't increase # of positions without the timeout.

James Villa

Apr 18, 2019

Hey Daniel,

To workaround the timeout issues, first I limit the stock selection to 250 based on some generic factor (i.e. most liquid, top market cap, etc.) and second compute input factors within the pipeline and masked by specific stock selection. It is more about code and resource efficiency. If you re a good python coder, you can try and transfer SVM routine inside BTS (before trading starts) which have a 5 minute limit, I tried but failed.

As Luc just said above, ",...someone would need to find a universe that makes sense for the SVM pipeline that works". Hope this helps.

Daniel Cascio (Blue Robin)

Apr 18, 2019

Cool, thanks for the info James and Luc. I'll go back to the drawing board to see what I can work up. I agree in that it comes down to more about code/structure and efficiency.

Daniel Cascio (Blue Robin)

Apr 25, 2019

Wondering if either of you have experienced the same error message I've been receiving when I use functions like the below.

"KeyError: 8554 There was a runtime error"

Specifically, when using --> spy_returns = returns[symbol('SPY')] or spy_returns = returns[sid(8554)] within a custom factor

class Alpha(CustomFactor):  
    inputs = [USEquityPricing.close]  
    window_safe = True  
    def compute(self, today, assets, out, close):  
        returns = pd.DataFrame(close, columns=assets).pct_change()[1:]  
        spy_returns = returns[symbol('SPY')]  
        # get beta and alpha by running linear regression  
        A = np.vstack([spy_returns, np.ones(len(spy_returns))]).T  
        m, p = np.linalg.lstsq(A, returns)[0]  
        out[:] = p

Leo M

Apr 27, 2019

Daniel, SPY is not there in your pipeline. you will probably need to add it using StaticAssets explicitly. Something like this

from quantopian.pipeline.filters import StaticAssets
spy_universe = StaticAssets(symbol('SPY'))
universe = universe | spy_universe

You've successfully submitted a support ticket.

Our support team will be in touch soon.