Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
K-means Clustering Help

Hi guys,

I am completely new to Quantopian! Recently, I have been trying to see if I can group comparable companies together using k-means clustering. I decided to test it on the financial industry first. I used variables like enterprise_value, market_cap, sustainable growth rate, ROA, ROE, ROIC as factors that can group different firms together. Then, within each cluster, I would long/short the top/bottom EV/EBITDA firms (ie. the undervalued/overvalued firms)

The problem is that I had to convert Pipeline data frame to an array in order to k-means clustering from sklearn library. Unfortunately, after I assigned each firm with a cluster label, I could not convert it back into the original DataFrame that had security number and column's title.

I hope that you guys can help me figure out the next step.

Thank you very much,
Thanh Duong

7 responses

Hey Thanh,

If I understood your problem correctly, this can be solved very easily using the power of simple column assignment. I assigned a column titled 'Cluster' to your cluster array. See attached notebook for results.

I deleted some duplicated imports (without showing work), as well as put the old deprecated code into comments so you can see what I deleted in the main body of your notebook.

Hope this helps you.
Cheers.
Nick

Hi Nick, thank you so so much for your help!

HI Karl, 50 is just a number that I choose randomly. There are ways for python to automatically find the optimal number of clusters, but I haven't found out about it yet!

You could try hierarchical clustering, which doesn't require a to specify the number of clusters. For the purpose of your strategy K-means probably works well though

I see, Thanh yes a good starting position to get it working, and Nick has shown in clear steps how the ['Cluster'] column is added to the result dataframe.

I was trying to condense the steps into a single-line Python statement:

result['Cluster'] = np.array(KMeans(n_clusters=50).fit(result.values).labels_).reshape((-1, 1))  

Results are quite different from Nick's - I am sure I have missed/misplaced some parts - see last cell in attached Notebook.

Karl Thank you very much for your input. The result is different because each time you do k-means clustering, even though for the same set of data, it will feed you a different set of clusters, they are all different. I am not sure how to fix this tho, but overall, I think the sets of clusters are pretty similar.

Luca, yes. I have heard of hierarchical clustering. Hopefully, I can use them one day. Thank you very much!

UPDATE: I have attached the completed algorithm. One concern that had is that whenever I use order_target_percent(stock, 1/len(context.groups)), in which context.groups is the list of securities that I want to trade, the algorithm did not buy anything at all! So, I had to use order_target_percent(stock, 0.02) instead, an estimation of what the securities weight should be like. Nick, Luca, Karl, do you guys know what the problem is?

Thank you all very much!

Some notes on the source code were wrong. I was trading the bottom 25% and 10% EV/EBITDA instead of the top.

whenever I use order_target_percent(stock, 1/len(context.groups)), ... the algorithm did not buy anything at all

Python thing, when dividing integers it returns an integer. This print will make it clear

           print 1/len(context.groups)  
           order_target_percent(stock, 1/len(context.groups)) #0.02)  

Use this and the ordering happens

1.0/len(context.groups)

Sometimes you'll see that done without the 0, like just 1. with the dot to make the output a floating point result.