I watched the "live risk model tearsheet review" recently (wish I could have made the actual webcast when I saw my tearsheet was picked!), which got me thinking about the training/testing set.
Personally, I'd like to be able to test using the whole time frame if possible and build my training/testing sets by holding out some other form of data. I've heard of others using US data and then testing on foreign markets, but we don't have foreign data so we can't do that.
Instead of that, I'm trying to use the CIK number from pipeline to build a custom filter. I thought a simple and unbiased rule would be even CIKs are my training set and odds are my testing set. Alternatively, you could use any number with modulo so that you can have multiple revisions of your hypothesis without multiple comparison bias.
Two things. First, does this make sense? I couldn't find anything that says how the CIKs are assigned but the 1's digit seems sufficiently random. Second, I'm having trouble constructing a filter from this. The CIK is provided as a string (I think), which is giving me trouble. Right now I have:
class trainingset(CustomFactor):
inputs = [Fundamentals.cik]
window_length=1
def compute(self, today, assets, out, cik):
cik = cik.astype(int)
out[:] = cik % 2 # can be any number
and in my pipeline construction I have:
cik_filter = (trainingset() == 0) # training is 0, test is 1
but I keep getting this error:
ValueError: Bin edges must be unique: array([ nan, nan, nan, nan, nan, nan])