Stefan,
First, about your concern that the pipeline 'generates an outlook array multiple times in a single minute which is redundant'. Actually, what you are seeing is the pre-fetch asynchronous optimization at work. That's normal and it really isn't doing any more work than it needs to. The pipeline is designed to pre-fetch and pre-calculate the data for efficiency.
Specifically,
When you run a Pipeline that uses a 90-day trailing window, we don't
just query the last 90 days every day, since that would be glacially
slow because you'd spend almost all of your backtest time doing
disk/network IO. Instead, what we do is load all the data we think
we'll need to run your pipeline for the next ~252 days (the number of
trading days in a year). We use that cache to pre-compute your entire
pipeline for a year and then feed the results to your backtest on the
dates when it would become available.
Take a look at this post https://www.quantopian.com/posts/introducing-the-pipeline-api. Also, if you really want to get into the details, the pipeline thinks in terms of a 'directed acyclic graph' to optimize the computations ( see https://en.wikipedia.org/wiki/Directed_acyclic_graph).
To see first hand how this works, I've attached your algorithm but added a counter for each time the compute function is called in your custom factor. The counter prints out before trading each day when the pipeline_output is called. The backtest is for a single year (1/4/2011 - 1/4/2012) or 253 trading days . Look at the logs and you will see:
2011-01-04 07:45 PRINT iterations through output_calc: 6
2011-01-05 07:45 PRINT iterations through output_calc: 6
2011-01-06 07:45 PRINT iterations through output_calc: 6
2011-01-07 07:45 PRINT iterations through output_calc: 6
2011-01-10 07:45 PRINT iterations through output_calc: 6
2011-01-11 07:45 PRINT iterations through output_calc: 6
2011-01-12 07:45 PRINT iterations through output_calc: 133
2011-01-13 07:45 PRINT iterations through output_calc: 133
...
2012-01-03 07:45 PRINT iterations through output_calc: 253
2012-01-04 07:45 PRINT iterations through output_calc: 253
The first 6 days worth of calculations were done on day one. By day 7 the pipeline had calculated 133 days worth of data. It finally ends up going through the compute function 253 times. This is just the number of times needed to get the results for 1/4/2011 - 1/4/2012.
Also... about the memory errors. A couple of things you can do to reduce memory is to reduce the factors. I commented out the following
'''
#standard filters
primary_share = IsPrimaryShare()
common_stock = morningstar.share_class_reference.security_type.latest.eq('ST00000001')
not_depositary = ~morningstar.share_class_reference.is_depositary_receipt.latest
not_otc = ~morningstar.share_class_reference.exchange_id.latest.startswith('OTC')
not_wi = ~morningstar.share_class_reference.symbol.latest.endswith('.WI')
not_lp_name = ~morningstar.company_reference.standard_name.latest.matches('.* L[. ]?P.?$')
not_lp_balance_sheet = morningstar.balance_sheet.limited_partnership.latest.isnull()
have_market_cap = morningstar.valuation.market_cap.latest.notnull()
'''
and it runs now. Those are redundant because they are included in the Q500US filter anyway. Another thing to reduce memory is to reduce the window_length for factors. In your case you maybe need just 252 days which is a typical trading year. The window_length is in trading days not calendar days. See https://en.wikipedia.org/wiki/Trading_day for more info.
Also, I changed
out[:] = (
(((net_income[-1] - net_income[-365]) > 0).astype(int)*0.5)
+ (((total_revenue[-1] - total_revenue[-365]) > 0).astype(int)*0.5)
)
to
out[:] = (
(((net_income[-1] - net_income[0]) > 0).astype(int)*0.5)
+ (((total_revenue[-1] - total_revenue[0]) > 0).astype(int)*0.5)
)
Notice the index of 0 (not -365). This way you can set the window_length when instantiating the factor and the compute logic can remain the same. It will just take the first (0) and last (-1) pieces of data.