The clue here is the error message "Dropped 99.2% entries from factor data: 3.2% in forward returns computation and 96.0% in binning phase". 96% of the values were dropped in the 'binning phase'.
What does that mean? The default get_clean_factor_and_forward_returns
behavior is to attempt to create 5 quantile buckets (or bins) each day each with an equal number of values. One common problem is when a factor has a lot of results with a single value. In this case, there are a lot zeros. The binning places all the zeros into a single bin, but since there are so many zeros, there aren't enough other values to create equal size bins. The binning fails and generates an error message similar to the one above.
So, one of the first things I typically do when using Alphalens, and analyzing factors in general, is to look at the distribution of factor values. First, ensure the data is meaningful. The BIGGEST issues with factors is often having a lot of meaningless data which obscures the good data. In this case there are a lot of zeros which don't add much information. Second, ensure the data is rather nicely distributed. That's a bit arbitrary but if values are 'bunched' up with a few stragglers at the extremes, maybe one would want to filter out the extremes? The couple of methods I use for a quick look at a factor are these:
# Plot a histogram of the factor values - looking for a somewhat uniform distribution
factor_data.my_factor.hist()
# List the quantity of each value - ensure there is not too many of a single value
factor_data.my_factor.value_counts()
Another approach is to not use the Alphalens defaults of quantiles=5, bins=None
. Turn it around and use quantiles=None, bins=5
. That will try to make 'equal width' bins and not 'equal quantity' bins. The binning won't typically fail this way. When doing this, take note of the bins it produced and perhaps filter based upon those bins. The get_clean_factor_and_forward_returns
method would look like this
merged_data = get_clean_factor_and_forward_returns(
factor=factor_data.my_factor,
prices=pricing_data,
periods=range(1,252,10),
bins=5,
quantiles=None
)
So, in this particular case, either drop all the zeros or use fixed bin sizes, and you'll eliminate the error. Dropping all the zeroes can be done by filtering the original pipeline or afterward by dropping those rows.
I've made some of these changes to the notebook to see how it works.