Batch-transform version of scikits-learn example ("finding co-fluctuating stocks")

Back to Community

posted

This is replicating James Jack's algorithm which is using scikits-learn to estimate clusters of covarying stocks.

This rewrite simplifies the code by using the new batch_transform. Moreover, the clusters are constantly being re-estimated on the newest data.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

23 responses

James Jack

Hi Thomas,

My old code had a bug in it. I think lines 48-57 need to be change to:

    # filter the sids into groups, in order they appear in c.sids  
    for i, grp_idx in enumerate(labels):  
        groups[grp_idx].append( int(c.sids[i]) )

and pass in the context "c" to the batch transform. Otherwise the order you iterate through SIDs in data.variation is different to the order they appear in c.sids. I think...

Any idea why data.variation is of length window_length+1 ?

Thomas Wiecki

Yeah, there is a bug in the batch_transform that the window size is not always constant. However, this is fixed and will be updated shortly.

Feel free to clone this and fix the bug you found!

Disclaimer

Grant Kiehne

Hello Thomas,

This goes back a bit, but could this clustering be applied to Nasdaq 100 list recently posted?

Grant

Thomas Wiecki

Hi Grant,

Most certainly. This is an interesting algorithm and I look forward to hearing how this does on the Nasdaq.

Thomas

Disclaimer

Grant Kiehne

Thomas,

Upon cloning and running as-is, I got this error:

67  Error   Nonexistent property: close

Any ideas?

Grant

Thomas Wiecki

Just ran this (cloned from some time ago) and it seems to run just fine. Not sure if there's a difference?

Disclaimer

Grant Kiehne

Thanks...I cloned the algo you posted immediately above and it runs. --Grant

Grant Kiehne

Your new code has this:

data[s]['variation'] = (data[s].close_price - data[s].open_price)

The code I cloned originally has:

data[s]['variation'] = (data[s].close - data[s].open)

Grant Kiehne

Here's the algo with the Nasdaq 100 stocks. Unfortunately, the log output gets clipped:

2013-01-29batch_cluster:60DEBUGFound 26 groups from 12 complete histories  
2013-01-29PRINTCluster 1: 24, 114, 1787, 8677, 23709  
2013-01-29PRINTCluster 2: 630, 2696, 3951  
2013-01-29PRINTCluster 3: 20680, 8816, 4246, 6683  
2013-01-29PRINTCluster 4: 9883, 12652  
2013-01-29PRINTCluster 5: 3806, 40207, 22316  
2013-01-29PRINTCluster 6: 1419, 15101  
2013-01-29PRINTCluster 7: 17632, 9736, 19725  
2013-01-29PRINTCluster 8: 122, 38650, 39095, 36930, 2618, 13862  
2013-01-29PRINTCluster 9: 14014, 24819, 24482  
2013-01-29PRINTCluster 10: 368, 25317, 23906, 25339, 13905, 6413  
2013-01-29PRINTCluster 11: 2663, 20208, 7671  
2013-01-29PRINTCluster 12: 42950, 4485  
2013-01-29PRINTCluster 13: 328, 14328, 2853, 5061, 39840, 14848  
2013-01-29PRINTCluster 14: 18870, 3212, 8655, 43919  
2013-01-29PRINTCluster 15: 1900, 27543, 43405, 5121, 5767, 8017, 8132  
2013-01-29PRINTCluster 16: 739, 27357, 13940  
2013-01-29PRINTCluster 17: 4668, 3450, 7061, 8158  
2013-01-29PRINTCluster 18: 22802, 8352  
2013-01-29PRINTCluster 19: 32301, 5166, 24518  
2013-01-29PRINTCluster 20: 27533, 1637, 12213, 5787, 26169  
2013-01-29PRINTCluster 21: 8857, 19917, 6295, 11901  
2013-01-29undefined:undefinedWARNLogging limit exceeded; some messages discarded

Have to figure out a work-around. Is the limit on number of lines of text? Or total characters? Something else?

Grant

Peter Cawthron

Hello Grant,

Is this the 22-line limit?

Logging is rate-limited (throttled) for performance reasons. The basic limit is two log messages per call of initialize and handle_data. Each backtest has an additional buffer of 20 extra log messages. Once the limit is exceeded, messages are discarded until the buffer has been emptied. A message explaining that some messages were discarded is shown.

Grant Kiehne

Thanks Peter,

Yep...figured that out this morning, after looking at the help page. I found a way of printing out all of the results, but the formatting was kinda ugly. I'll give it another go when I get the chance.

Grant

Grant Kiehne

Hello Thomas,

Would it be correct to assume that within a given cluster, there might be a trading pair (like GLD & GDX were)? If so, my thought is to ignore all of the clusters with only one stock, and then look for pairs within the remaining clusters. I came across this:

https://www.leinenbock.com/adf-test-in-python/

What do you think? Could it be applied here?

Grant

Grant Kiehne

I tweaked the output code so that all clusters are displayed in the log:

result = '------------------\n'  
    if groups is not None:  
        # display stock sids that co-fluctuate:  
        for i, g in groups.iteritems():  
            result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))  
        print result

I also made a change so that security symbols are displayed:

# filter the sids into groups, in order they appear in c.sids  
    groups = defaultdict(lambda: [])  
    for i, grp_idx in enumerate(labels):  
        # groups[grp_idx].append( int(c.sids[i]) )  
        groups[grp_idx].append( c.sids[i].symbol )

Grant

Grant Kiehne

Here's the output from the attached backtest:

2013-10-10 PRINT------------------
Cluster 1: AAPL, CMCS, VRSK
Cluster 2: ADI, ADSK, AMGN, INTC
Cluster 3: ADP, CTRX, CTXS, MYL, NFLX
Cluster 4: BBBY
Cluster 5: BIDU, BRCM, CHTR, CSCO, DTV, MNST, NUAN
Cluster 6: AMAT, CTSH, FAST
Cluster 7: DELL, DLTR, GOLD, SRCL
Cluster 8: ALXN, ESRX
Cluster 9: EXPD, GRMN, LINT, LMCA, TSLA
Cluster 10: CHKP, FFIV, FISV, GILD
Cluster 11: CERN, CHRW, EQIX, FOXA, MDLZ, PCAR, SNDK
Cluster 12: GOOG
Cluster 13: AVGO, CA, DISC, HSIC, KLAC, MAT, MCHP, NTAP, QCOM, XRAY
Cluster 14: KRFT
Cluster 15: ADBE, ISRG, MSFT, ORLY
Cluster 16: ATVI, LBTY, MU
Cluster 17: AMZN, EBAY, LLTC, MXIM, SYMC
Cluster 18: ALTR, FB, GMCR, NVDA, SHLD
Cluster 19: AKAM, BIIB, PCLN, WFM
Cluster 20: CELG, SBAC, TXN, VOD, WYNN
Cluster 21: SBUX, SIAL
Cluster 22: PAYX, REGN, SIRI, STX, WDC, XLNX
Cluster 23: EXPE, FOSL, ROST, SPLS, VRTX
Cluster 24: COST, INTU, VIAB, YHOO

Any ideas on how to interpret the results? For example, when I dig into Cluster 5, we have:

BIDU (Baidu) - Chinese-language Internet search provider
BRCM (Broadcom Corporation) - global semiconductor solution for wired and wireless communications
CHTR (Charter Communications) - provides cable services in the United States, offering a range of entertainment, information and communications solutions to residential and commercial customers
CSCO (Cisco Systems) - designs, manufactures, and sells Internet protocol (IP)-based networking and other products related to the communications and information technology (IT) industry and provide services associated with these products and their use
DTV (DIRECTV) - provides digital television entertainment in the United States and Latin America
MNST (Monster Beverage Corporation) - develops, markets, sells and distributes alternative beverages
NUAN (Nuance Communications) - a provider of voice and language solutions for businesses and consumers globally

The cluster sorta makes sense, except for MNST, which is a beverage company. Does the algorithm also provide a measure of the strength of the individual cluster members? For example, within Cluster 5, are there securities that are more confidently assigned to the cluster, with other securities having a weaker association?

Grant

Thomas Wiecki

Hi Grant,

This looks pretty cool! The clusters might be good candidates for pair-trading (e.g. https://www.quantopian.com/posts/fixed-version-of-ernie-chans-gold-vs-gold-miners-stat-arb) so it'd be neat to extend the algo in that way.

Thomas

Disclaimer

Dan Dunn

I think you want sid(1406) for CELG, not sid(40207). sid(40207) is actually CELGZ. We have a data problem where they are both reporting at CELG that I haven't yet resolved.

Disclaimer

Grant Kiehne

Thanks...Grant

Grant Kiehne

Hello Thomas & all,

I'm considering updating this algo to use the history API, unless someone has already done it. Thoughts?

I figure I'll learn more Python, and get a better feel for this clustering magic.

Grant

Thomas Wiecki

Hi Grant,

I think that's an excellent idea and should be pretty straight forward. In principle you could just remove the batch_transform decorator and manually call the function passing in the DataFrame returned by history().

Thomas

Disclaimer

Grant Kiehne

Thanks Thomas,

That's what I thought. And when (if ever) the history API supports minute-level data, the algo can be run at a higher frequency.

Grant

Grant Kiehne

Hello Thomas,

Here's a rough cut. I get a build error "59 Error Runtime exception: ValueError: array must not contain infs or NaNs" triggered by the line:

edge_model.fit(X)

Any idea what's going on? My understanding is that the code is actually run in some fashion as part of the build, right? However, it's challenging to debug, since the log output is suppressed.

Grant

from sklearn import cluster, covariance  
import numpy as np  
from collections import defaultdict

# based on the example at:  
# http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html

# use in quick backtester

update_period = 5*390 # update clusters at this period in minutes

def initialize(context):  
    c = context  
    # Nasdaq 100 from https://www.quantopian.com/posts/list-of-nasdaq-100-sids-to-use-in-your-algo  
    # c.sids = [sid(24),    sid(114),   sid(122),   sid(630)  , sid(67),  
              # sid(20680), sid(328),   sid(14328), sid(368),   sid(16841),  
              # sid(9883),  sid(337),   sid(38650), sid(739),   sid(27533),  
              # sid(3806),  sid(18529), sid(1209),  sid(40207), sid(1419),  
              # sid(15101), sid(17632), sid(39095), sid(1637),  sid(1900),  
              # sid(32301), sid(18870), sid(14014), sid(25317), sid(36930),  
              # sid(12652), sid(26111), sid(24819), sid(24482), sid(2618),  
              # sid(2663),  sid(27543), sid(1787) , sid(2696),  sid(42950),  
              # sid(20208), sid(2853),  sid(8816),  sid(12213),  sid(3212),  
              # sid(9736),  sid(23906), sid(26578), sid(22316), sid(13862),  
              # sid(3951),  sid(8655),  sid(25339), sid(4246),  sid(43405),  
              # sid(27357), sid(32046), sid(4485),  sid(43919), sid(4668),  
              # sid(8677),  sid(22802), sid(3450),  sid(5061),  sid(5121),  
              # sid(5149),  sid(5166),  sid(23709), sid(13905), sid(19926),  
              # sid(19725), sid(8857),  sid(5767),  sid(5787),  sid(19917),  
              # sid(6295),  sid(6413),  sid(6546),  sid(20281), sid(6683),  
              # sid(26169), sid(6872),  sid(11901), sid(13940), sid(7061),  
              # sid(15581), sid(24518), sid(7272),  sid(39840), sid(7671),  
              # sid(27872), sid(8017),  sid(38817), sid(8045),  sid(8132),  
              # sid(8158),  sid(24124), sid(8344),  sid(8352),  sid(14848)]  
    c.sids = []  
    # some sids to look at  
    c.sids.append(sid(24))  
    c.sids.append(sid(18522))  
    c.sids.append(sid(5061))  
    c.sids.append(sid(20486))  
    c.sids.append(sid(5885))  
    c.sids.append(sid(4707))  
    c.sids.append(sid(3149))  
    context.elapsed_minutes = 0  
# @batch_transform(refresh_period=5, window_length=12)  
def batch_cluster(attribute,context):  
    c = context  
    # tell it we're looking for a graph structure  
    edge_model = covariance.GraphLassoCV()  
    X = attribute.values.copy()  
    X /= X.std(axis=0)  
    edge_model.fit(X)  
    # now process into clusters based on co-fluctuation  
    _, labels = cluster.affinity_propagation(edge_model.covariance_)  
    log.debug("Found {0} groups from {1} complete histories".format(max(labels)+1,len(attribute)))  
    # filter the sids into groups, in order they appear in c.sids  
    groups = defaultdict(lambda: [])  
    for i, grp_idx in enumerate(labels):  
        # groups[grp_idx].append( int(c.sids[i]) )  
        groups[grp_idx].append( c.sids[i].symbol )  
    return groups  
def handle_data(context, data):  
    if context.elapsed_minutes % update_period != 0.0:  
        return  
    context.elapsed_minutes += 1  
    prices_open = history(13, '1d', 'open_price',ffill=True)[0:-1]  
    prices_close = history(13, '1d', 'close_price',ffill=True)[0:-1]  
    prices_delta = prices_close - prices_open  
    # print prices_delta  
    #  
    # return  
    c = context  
    # for s in c.sids:  
        # if s in data:  
            # # add the day's price range to the list for this sid  
            # data[s]['variation'] = (data[s].close_price - data[s].open_price)  
    # note that the model wont work if there are different number of  
    # entries in data.variation.  
    groups = batch_cluster(prices_delta,c)  
    # if groups is not None:  
        # # display stock sids that co-fluctuate:  
        # for i, g in groups.iteritems():  
            # print 'Cluster %i: %s' % ((i + 1), ", ".join([str(s) for s in g]))  
        # # ...ADD ORDER CODE HERE...  
    result = '------------------\n'  
    if groups is not None:  
        # display stock sids that co-fluctuate:  
        for i, g in groups.iteritems():  
            result = result + 'Cluster %i: %s\n' % ((i + 1), ", ".join([str(s) for s in g]))  
        print result

Thomas Wiecki

Hi Grant,

You might have to call .dropna() on the DataFrame.

Thomas

Disclaimer

Grant Kiehne

Got it working, however, I still get the error when I comment out these lines:

    if np.any(pd.isnull(X_zscore)):  
        print 'null found in X_zscore'  
        return None

But I don't see 'null found in X_zscore' in the log output. So, something ain't right with the build process, it seems.

By the way, I added:

import pandas as pd  
pd.set_option('use_inf_as_null',True)

This was based on the guidance on http://pandas.pydata.org/pandas-docs/dev/missing_data.html that .isnull() will ignore inf and -inf otherwise.

Grant

Log output:

2013-09-03batch_cluster:70DEBUGFound 2 groups from 12 complete histories  
2013-09-03PRINT------------------ Cluster 1: ARMH Cluster 2: AAPL, MSFT, FCS, PEP, MCD, GE  
2013-09-10batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-09-10PRINT------------------ Cluster 1: ARMH Cluster 2: AAPL, MSFT, FCS, PEP, MCD Cluster 3: GE  
2013-09-17batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-09-17PRINT------------------ Cluster 1: ARMH, PEP, MCD, GE Cluster 2: MSFT Cluster 3: AAPL, FCS  
2013-09-24batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-09-24PRINT------------------ Cluster 1: AAPL, FCS, GE Cluster 2: PEP Cluster 3: ARMH, MSFT, MCD  
2013-10-01batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-10-01PRINT------------------ Cluster 1: ARMH Cluster 2: AAPL, MSFT, FCS, MCD, GE Cluster 3: PEP  
2013-10-08batch_cluster:70DEBUGFound 2 groups from 12 complete histories  
2013-10-08PRINT------------------ Cluster 1: AAPL Cluster 2: ARMH, MSFT, FCS, PEP, MCD, GE  
2013-10-15batch_cluster:70DEBUGFound 2 groups from 12 complete histories  
2013-10-15PRINT------------------ Cluster 1: AAPL Cluster 2: ARMH, MSFT, FCS, PEP, MCD, GE  
2013-10-22batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-10-22PRINT------------------ Cluster 1: ARMH, GE Cluster 2: MSFT Cluster 3: AAPL, FCS, PEP, MCD  
2013-10-29batch_cluster:70DEBUGFound 3 groups from 12 complete histories  
2013-10-29PRINT------------------ Cluster 1: ARMH, PEP, MCD, GE Cluster 2: MSFT Cluster 3: AAPL, FCS  
End of logs.

You've successfully submitted a support ticket.

Our support team will be in touch soon.