Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
CustomFactor calculation - managing NaN's when doing zfactor

Hi,
I'm writing a thesis where I investigate the factors, in this case from the "quality minus junk" paper by Asness et al. I want to consider international equities and I'm running into trouble as these datasets have a lot of missing data for fundamental factors using the FactSet.Fundamentals.
For instance, their profitability factor is a z-score weighted by 6 factors, each z-scored:

Profitability = z(z_gpoa + z_roe  + z_roa + z_cfoa + z_gmar + z_acc)  

The problem is that if one of the "sub-factors" are NaN the whole profitability factor becomes NaN so I want to mitigate this by just ignoring the NaN factors. I've unsuccessfully tried to do it inside the CustomFactor itself by calculating the zscore using the following on each "sub-factor" inside the CustomFactor:
zscores = lambda x: (x - nanmean(x, axis=0)) / nanstd(x, axis=0) But it doesn't want to run. So instead I thought I would solve it in pipeline by applying .zscore() to each customfactor and find a way where it ignores the NaN entries...but so far I haven't been able to do it - anyone who can see a good solution?
I also tried some way of setting the NaN values to zero after first calculating the zscore on the individual z-factor, but then maybe backfill/forwardfill is a more appropriate.

8 responses

[solved]

managed to perform zscore calculation inside CustomFactor using np.nanmean and np.nanstd and setting NaNs to zero after creating the zfactor. That way NaNs won't affect the factor tail values and I can combine several zscore factors.
Still some strange behaviour, i.e. for GPOA is returns only 0 values, even if the raw data is a mix of numbers and NaNs...

Hi Terje,

I guess GMAR should be (REVT -COGS)/SALE instead of (REVT - COGS) / AT.

Best,
Sergii

Terje,

Would please be so kind as to post your solution code?

Hi Nathan,
I found a way around it, but not using customfactor. So my solution was to run the pipeline without calculating z-scores in the CustomFactor and then calculating z-scores afterwards. So my problem was to z-score normalize data which contain elements of NaN or Inf values:
Pseudo code:
1. Extract columns with factor to z-score
2. Replace +/-Inf values with NaN
3. Replace NaN values with mean value for the given date
4. Apply 95% winsorization on each factor for each date
5. Z-score normalize each factor for each day

The reason for changing NaN/Inf values to the mean is to make them zero after normalization. This is not a perfect solution. The ideal way to do the whole processing is to ignore NaN, then winsorize, then normalize, then throw the NaN back into the distribution as zero (or remove them). But in my case my factor is a weighting of several sub-factors - so my letting some of the sub-factors they at leat become ignored when I winsorize the combination of several sub-factors (and I allow for some missing data)

Code sample:

df2=df[['ACC', 'GPOA','dGPOA', 'ROE', 'dROE', 'ROA', 'dROA', 'CFOA', 'dCFOA', \  
       'GMAR', 'dGMAR', 'AltZ', 'EVOL', 'LEV','BAB']]  
columns = list(df2)  
from scipy import stats

# In this for loop we replace NaN values with the column mean value  
for x in columns:  
    df2[x] = df2[x].replace([np.inf, -np.inf], np.nan)  
    fd = df2[x].unstack().fillna(df2[x].unstack().mean(axis=1)[0])  
    df2[x] = fd.stack()

# Use clip to apply a 95% winsorization window, i.e. outliers are replace by 2.5th or 97.5th percentile value  
for x in columns:  
    df2[x] = df2[x].clip(df2[x].quantile(0.025), df2[x].quantile(0.975))

# Use zscore to create column-wise z-scored values of our factors  
df2 = df2.apply(stats.zscore)  

You could also apply Numpy’s nan_to_num method on winsorized zscored factors in a CustomFactor. By default it replaces NaN with 0 and +- infinite with large positive/negative values.

@Joakim: That is a very good suggestion, never tried the nan_to_num method (I think)!
In general, just be careful replacing NaN with a number because it will shift the data and in some cases it could be important for your factor. E.g. if the true mean value is 10 - replacing many NaN's with 0 will shift the mean "to the left". In some cases it could be best to replace with the mean value and other cases zero would be appropriate (or good enough).

Absolutely. I would only fill NaNs with 0 for factors that have been zscored first. If passing in the raw data, I believe you could fill with the mean or median value instead.

Blockquote Absolutely. I would only fill NaNs with 0 for factors that have been zscored first. If passing in the raw data, I believe you could fill with the mean or median value instead.

Because zscore replaces NaN values with zero?

Using np.nan_to_num and stats.zscore() I should be able to have my NaNs as mean values, even if they are 0, since ZScore "centers" data around 0, exact?