Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
help w/ universe sort algorithm

All,

I'm working on an algorithm to sort a set of securities (selected using set_universe), by comparing their price (or volume, etc.) histories. The basic outline is:

  1. Get a trailing window of prices for all securities.
  2. Code the prices with z-scores (see Ref. 2 in algorithm).
  3. Convert the coded prices into text strings.
  4. Rank the securities (sids), based on a similarity comparison of the text strings

One application would be to identify outliers that are not following the overall market trend.

I'm looking for advice on step 4 above. The ranking will be done based on a text compression based comparison function (NCD, per Ref. 1 in the algorithm). Perhaps a Python expert can tell me the best way to do this. Note that I need to know the new ordering relative to the original, by sid (i.e. rank the sids based on a similarity criterion).

Generally, I suspect that the algorithm could be sped up with better coding...any ideas?

Also, sometimes I get the error:

There was a runtime error.  
ValueError: cannot convert float NaN to integer  
USER ALGORITHM:47, in handle_dataGo to IDE  
X[j] = X[j] + str(int(coded_d[i,j]))  

Any idea if this is due to the set_universe changing the list of sids, or if the batch transform filling function is not working (as I understand, it should clean all of the NaN's due to missing trades)?

Grant

5 responses

Hello Greg,

The NaN issue can be reproduced like this:

bottom = 12.0  
range=0.1  

and running the backtest from 01/01/12 to 03/31/12. The NaNs originate with the

get_data(data, context.stocks)

command. The data looks like this:

2012-03-06 PRINT context.stocks:  
2012-03-06 PRINT  
2012-03-06 PRINT [23497, 30666, 4218, 35998, 36929]  
2012-03-06 PRINT d:  
2012-03-06 PRINT  
2012-03-06 PRINT[[ 11.15 6.65 8.35 20.13 nan]  
[ 11.65 6.7 8.5 20.93 nan]
[ 12. 6.68 8.7499 21.02 nan] ...

The NaNs then get propagated through z_d and coded_d.

Regards,

Peter

Thanks Peter,

Looks like I need to check for NaN's. Gonna have to poke around a bit at other code using set_universe to see if anybody else has coded in a NaN check...should be straightforward.

Grant

I've made some headway on this algorithm (see attached), but still have problems:

  1. What is the best way to check for NaN's in the matrix returned by the batch transform?
  2. The program runs fine for range = 0.1 (a handful of securities), but when I change to range = 0.2, it seems to hang. Any idea what's going on? Am I running out of memory?

Grant

Hello Grant,

I thought this might remove the NaNs:

@batch_transform(refresh_period=R_P, window_length=W_L) # set globals R_P & W_L above  
def get_data(datapanel,sids):  
    datapanel['price'] =(datapanel['price'].fillna(method='ffill')).fillna(method='backfill')  
    return datapanel['price'].as_matrix(sids)  

But even so I can still an an error with:

set_universe(universe.DollarVolumeUniverse(90,90.3))  

I've not had the algo hang with 0.2 or 0.3 as the range.

Regards,

Peter

Thanks Peter,

I'll have to continue digging into the NaN problem. When my batch transform returns NaNs, I'm not sure that filling is the right approach, since the NaNs may mean that the security is not actually trade-able at the time of the backtest. The best approach may be to exclude any security with NaN data (just ignore columns with NaNs in the matrix returned by the batch transform).

Regarding the algorithm hanging, I think that the issue is with:

all_sid_X = list((itertools.permutations(unranked_data)))  

As the length of unranked_data grows, the number of permutations becomes unmanageable, since it is equal to N-factorial, where N is the length of the list to be permuted. I should have realized that this scaling would be problematic...duh! So much for the brute-force approach...

Grant