Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Unable to reconcile Information Analysis of Alphalens

Hi all,

I was conducting an analysis on a simple factor and tried running it via alphalens.tears.create_factor_tear_sheet and alphalens.utils.get_clean_factor_and_forward_returns . I provided same input for both APIs.

It came as a surprised to me that the result generated were vastly different. To be clear, I did set the alphalens.utils.get_clean_factor_and_forward_returns to be groupby sector. That may explain some of the variation. I am not sure if this is a code error or whether I have stumbled upon something else. Can someone assist?

9 responses

Anthony,
The only difference I can see right away is the actual quantile split itself is different.
The Old Quantile 1(163301) counts are different than the New Quantile 1(163327) counts:

This might account for the difference, which will be most prevalent with the Period 1 stats.
This might have to do with the difference between using ".qcut()" versus ".cut()" in the new versions of alphalens...not sure,
but I ran into an "edges-are-the-same" error lately when using alphalens.

Also, I added in the group parameters to the old create_factor_tear_sheet, to just in case it mattered, which it didn't seem to.

Old:

alphalens.tears.create_factor_tear_sheet(factor=result['factor'],  
                             prices=pricing,  
                             groupby=result["Sector"],  
                             show_groupby_plots=False,  
                             periods=(1, 5, 10,30),  
                             quantiles=2,  
                             bins=None,  
                             filter_zscore=10,  
                             groupby_labels=MORNINGSTAR_SECTOR_CODES,  
                             long_short=True,  
                             avgretplot=(5, 15),  
                             turnover_for_all_periods=False)

 Quantiles Statistics  
 factor_quantile    min max mean     std     count count %  
1    1.0     679.0    337.466439    194.728263  163301  50.016999  
2          670.0 1358.0 1012.172327 194.849422  163190  49.983001  

New

factor_data = alphalens.utils.get_clean_factor_and_forward_returns(result['factor'],  
                                                                   pricing,  
                                                                   groupby=result["Sector"],  
                                                                   quantiles=2,  
                                                                   periods=(1,5,10,30),  
                                                                   groupby_labels=MORNINGSTAR_SECTOR_CODES  
                                                                   )  
alphalens.tears.create_full_tear_sheet(factor_data,  
                                       group_adjust=True,  
                                       by_group=True)


Quantiles Statistics  
factor_quantile  min    max mean                     std                            count            count %  
1 1.0   679.0   337.460983  194.725785  163327  50.01669  
2 670.0 1358.0  1012.176568 194.847478  163218  49.98331

alan

@Anthony the difference is due to group_adjust=True in the second one. From the documentation:

group_adjust : bool
Demean forward returns by group before computing IC.

@Alan the "edges-are-the-same" error is due to pandas.qcut (the function that calculate the quantiles) not being able to provide a workaround when identical factor values span more than one quantile (e.g if you have too many 0s that span more than one quantile). A workaround could be to use custom quantile ranges to group together the same values (e.g. quantiles=[0, .10, .30, .70, .90, 1.]) or to use bins=5 and quantiles=None (internally pandas.cut will be used), that chooses the bins/quantiles to be evenly spaced according to the values themselves, while quantiles=5 chooses the bins/quantiles so that you have the same number of records in each bin/quantile.

@Luca - thanks for the insights on "edges-are-the-same" error!

@Luca - If you are demeaning the forward returns by sector(group_adjust=True),
what does the numbers for the IC block of alphalens represent?
Shouldn't it be a vector of numbers, one for each sector,
where the Ann-IR is the Information ratio wrt the sector-as-benchmark?

@Alan, imagine your factor is very good in predicting the relative performance of each stock compared to the other ones in the same sector, but the factor is very noisy when comparing stocks belonging to different sectors. In this scenario the IC analysis won't show any predictive ability because IC is the correlation between the factor values and the forward returns of the full universe of stocks. You actually would like to see the performance relative to the same sector.

if you demean the forward returns by sector, you are actually removing the average sector performance from each stocks. Because you do this operation for each sector, you are left with positive returns for stocks that performed better than their sector average return and with negative returns for stocks that performed worse than their sector average return. At this point you only need one IC analysis because after the "sector neutralization" the factor top quantile should match a positive return and the factor bottom quantile should match a negative return, independently of the sector.

As a side note, the return demeaning is also used when you specify long_short=True (the default behaviour) to remove the marked effect. Without the demeaning is difficult to have a high correlation (IC) because the forward returns are influenced by the market. So long_short=True force the IC to be calculated on relative returns (stock returns relative to other stocks, not absolute values), independently of the market performance.

Thanks @Luca and @Alan for trying to help. Whilst I understand the intention of demeaning, what I struggled with is that the result became significant after you apply demean . In fact, in all the factors (half a dozen) that I tried, it became significant using the second method (alphalens.tears.create_full_tear_sheet(factor_data, group_adjust=True, by_group=True) . That defy logic. Hence my confusion. do check out the notebook and change some factor input and try it out. I think demean alone should not make that much of a difference.

As surprising as it might sound demean alone has a huge impact on the IC analysis and if you set
group_adjust=False you can see that the first and second tear sheets become identical

Thanks @Luca. Much appreciate it. This solves the puzzle for me.

@Luca
Hi,

I am trying to search for what demean is doing in alphalens and come across your answer.
Can you sort of explain how would demean relate to long short portfolio performance?

Thanks.