@ James Hoang I'll try to answer your questions.
First you asked "Why did you choose window_length 200 in the second custom factor 'Consecutive_Days_Above_Zero' - is that just an arbitrary max length if its all above 0?" Yes, that was just an arbitrary look back length. If the max days above zero you ever wanted to consider is less, then you could shorten that.
The second question "why the original method failed" is a little more involved.
It is correct that the compute function in a custom factor is executed each day. Let's take an example how that works.
# Assume the pipeline is run from 2020-02-04 thru 2020-02-07
# The following code would be executed 4 times with new data each day
sma50cf = np.nanmean(last_close_price[0:-1], axis=0)
# sma50cf would have the following values (using pseudocode) each day pipeline is run
2020-02-04 mean(2020-2-3 - 50 days : 2020-2-3)
2020-02-05 mean(2020-2-4 - 50 days : 2020-2-4)
2020-02-06 mean(2020-2-5 - 50 days : 2020-2-5)
2020-02-06 mean(2020-2-6 - 50 days : 2020-2-6)
Every day the pipeline is run and fetches a new window of data. Also note the pipeline runs before the market opens each day so it fetches the previous days data. That is why the dates above are shifted by 1 day. For each day the value of 'sma50cf' is static. It's the mean of the last 50 days, from the pipeline date, of close prices. Then, the next day, it will be the mean of the last 50 days from the next pipeline date.
Now, let's look at the code for comparing if a day is above the 50 day SMA.
# Again, use the dates above. Let's look at the results of the following code
bad_day = last_close_price < sma50cf
# The 'last_close_price' is a 2D numpy array.
# The number of rows is specified by window_length (50) in this case
# The number of columns is equal to the number of assets
# 'sma50cf' is 1D numpy array.
# It has one row with the mean value of the 50 days of prices.
# There is one value for each asset
# 'bad_day' will also be a 2D numpy array like 'last_close_price'
# What do the rows of 'bad_day' look like?
bad_day[2020-2-3] = last_close_price[2020-2-3] < mean(2020-2-3 - 50 days : 2020-2-3)
bad_day[2020-1-31] = last_close_price[2020-1-31] < mean(2020-2-3 - 50 days : 2020-2-3)
bad_day[2020-1-30] = last_close_price[2020-1-30] < mean(2020-2-3 - 50 days : 2020-2-3)
bad_day[2020-1-29] = last_close_price[2020-1-29] < mean(2020-2-3 - 50 days : 2020-2-3)
etc...
Note the above dates represent the pipeline being executed on 2020-2-4. The above dates are what the compute
function sees on 2020-2-4. The first comparison makes sense. It compares the close on 2020-2-3 to the mean(2020-2-3 - 50 days : 2020-2-3). However, the other comparisons do not. They compare a previous close with the entire mean. Now, this can be done but I am assuming it's not what is intended. I am assuming what is intended is the following
# 'bad_day' should look like this
bad_day[2020-2-3] = last_close_price[2020-2-3] < mean(2020-2-3 - 50 days : 2020-2-3)
bad_day[2020-1-31] = last_close_price[2020-1-31] < mean(2020-1-31 - 50 days : 2020-1-31)
bad_day[2020-1-30] = last_close_price[2020-1-30] < mean(2020-1-30 - 50 days : 2020-1-30)
bad_day[2020-1-29] = last_close_price[2020-1-29] < mean(2020-1-29 - 50 days : 2020-1-29)
etc...
Notice the mean values 'roll' depending upon which row in 'bad_day' is being evaluated. Each iteration of the pipeline does pass a new window of data to the compute
function. However, as that function is being executed, the data is fixed for that day.
There isn't a good 'roll' function in numpy which is why I didn't pursue an approach trying to implement the above 'roll' inside of the compute method. However, there is no need to 'roll' the data explicitly. The approach I suggested get's around that. It uses the fact that the pipeline effectively 'rolls' a new window of data and presents it to the compute
function each day. The approach creates a factor which outputs the last (most recent) value each day. Then, by passing that factor to a second 'Consecutive_Days_Above_Zero' factor, the pipeline mechanism does the 'roll' work for us. Pipeline calculates the value of the our factor for each day and present these 'rolled' means to the 'Consecutive_Days_Above_Zero' factor.
Hope that makes sense?