Segaran's Non-Negative Matrix Factorization Implementation

Back to Community

posted Feb 11, 2013

Hi all,

Thought I'd share an implementation of Non-negative matrix factorization (NNMF).. that's a mouthful so perhaps an explanation is due.

First and foremost, credit for the original algorithm design in the source code goes to Toby Segaran, author of Collective Intelligence - an excellent book on machine learning. My primary value add really is only the implementation on this platform (which by the way, makes it much easier to test and experiment with Segaran's example).

Motivation

We are looking to identify dates during which it appears that some event drove up trading volume across some group of stocks. It is important to differentiate this from correlation. Whereas attempting to find the correlation between two times series of volume would "average out" periods of high co-movement and periods of low co-movement, this algorithm searches just for periods where the trading volumes moved together. While it is not a trading strategy of its own accord, it would be a good base for research for a strategy. For instance, let's say you found a series a dates on which Apple's stock experienced high volume trading and on the same series of dates, so did Microsoft's. Yet, Microsoft and Apple's stock volumes don't always move together. Often Apple will have a high spike in volume that Microsoft doesn't experience and vice versa. So what's happening on those days where they do move together? Perhaps Apple's trading volume spikes on days their board make announcements but only certain kind of announcements affect Microsoft trading... with some research (ie. find what announcements were made on the dates the algorithm spits out) this algorithm is one way to identify what sort of announcements effect both stocks and what don't.

I think you are best served experimenting with this algorithm after you read Chapter 10 of Segaran's Collective Intelligence - I believe the book is now available for free as a PDF and is an excellent primer to machine learning.

3 responses

Dan Dunn

Feb 11, 2013

That sounds fascinating. I definitely need to do the reading you suggest. It looks like the O'Reilly version is here, and the free download is here.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thomas Wiecki

Feb 12, 2013

Very interesting. I took the liberty to pythonize the code a little bit to increase readability.

I think matrix factorization methods have a lot of potential in algorithmic trading. One interesting blog post I found uses Robust PCA to do automatic anomaly detection.

In regards to your algorithm, can you provide an intuition for what the factor loadings mean? Also, do you have more ideas on how to find those events that triggers co-movement in the volume?

Disclaimer

Deleted User

Feb 12, 2013

Interesting article and thanks for cleaning up the code! Much more readable.

To answer your questions...I think your first question is quite difficult. Because I'm choosing an arbitrary number of factors and because some of the features appear to only affect one stock - it's not entirely clear what a "heavily weighted feature" actually means. I would compare the "features" as similar to the results of a similar unsupervised method like k-means clustering.

I would venture to say that features that have a heavy weight on only one stock are essentially meaningless. In a sense, these factors are like "residuals" - any and all sudden anomalies in trading volume in a stock that have no similarity to any other stock can be conveniently grouped in a 'feature'. The problem with this is that there's no reason to believe that these anomalies are caused by the same underlying reason. Far more interesting are the features that appear to affect multiple stocks at once. It's particularly interesting when there appears to be a feature that marks high trading volume across some (but not all) stocks in the same industry.

In terms of finding events that trigger co-movement in volume, I think there's many ways to approach this. Personally, I've done some work with data scraping - both news and social media data in particular; I find this method to be a good starting point because of its simplicity and intuitively makes sense. This algorithm spits out dates on which particular features are the strongest and stocks for which a particular feature applies heavily. Assuming we choose a feature that applies heavily to at least 2 stocks, I would suggest that the best way to determine what (if anything) that particular feature "means" (and hence how to find it in the future) would be to scrape news releases perhaps the day before, the day of, and the day after. Next, we look for similarities between these news releases. We also look for differences between these and news releases from other dates that may have high trading volume (for one of the particular stocks that this feature applied to heavily but not others that presumably wasn't identified by the algorithm).

Of course, its very likely we won't be able to distinguish anything just by manually looking (or even more likely, there will be far more information than can be manually processed) - this is where I've used natural language processing in the past.

You've successfully submitted a support ticket.

Our support team will be in touch soon.