Forward filling nans in pipeline custom factors

Back to Community

Blue Seahawk

edited

Tools and Tips Pipeline

[Edit 7/2019, there's an issue with this, see messages below]

A way to forward fill nans in pipeline adapted from stackoverflow.

Example:

class Quality(CustomFactor):  
    inputs = [Fundamentals.total_revenue]  
    window_length = 24  
    def compute(self, today, assets, out, total_revenue):  
        total_revenue = nanfill(total_revenue)  
        out[:] = total_revenue

def nanfill(arr):  
    mask = np.isnan(arr)  
    idx  = np.where(~mask,np.arange(mask.shape[1]),0)  
    np.maximum.accumulate(idx,axis=1, out=idx)  
    arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]  
    return arr

16 responses

Grant Kiehne

Why would Fundamentals.total_revenue require forward filling? Are there companies for which total_revenue is not reported by the company? Or are these errors in the Fundamentals database?

Blue Seahawk

Fundamental reporting is not consistent from all companies.

Counting nans also and logging counts if there are any nans:

def nanfill(arr):  
    nan_num = np.count_nonzero(np.isnan(arr))  
    if nan_num:  
        log.info(nan_num)  
        log.info(str(arr))  
    mask = np.isnan(arr)  
    idx  = np.where(~mask,np.arange(mask.shape[1]),0)  
    np.maximum.accumulate(idx,axis=1, out=idx)  
    arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]  
    if nan_num:  
        log.info(str(arr))  
    return arr

In my experience in backtests with nans forward filled this way I've seen some improved performance.
Try for example with factors in the Notebook at https://www.quantopian.com/posts/faster-fundamental-data

Evans Head

Very helpful, thank you

Stefan

@ Blue Seahawk:
your implementation forward fills values from one row to the next and not from one timestamp to the next. This has the unfortunate outcome, that you are forward filling values from other stocks with the same time stamp, not from the same stock with different time stamps.

I am also looking for an implementation but could not find any yet... Does anybody have a correct implementation?

Blue Seahawk

I've been using it and always thought it was right. Can you post an algo to demonstrate? If so, try axis=0 and see what happens then eh? Thanks.

Stefan

Hi,
I have made a quick pipeline calculating Capex to Cashflows. One time without forward fill, one time with your forward fill implementation. If you scroll through the selections you will see that most of the time the nan value will be replaced with the value above it from another stock...unfortunately.

Or do you see anything wrong in the implementation?

When I try to switch axis=0, then all null values have suddenly the same value - so this might also not be the solution.

Blue Seahawk

Thanks for the content to work on. I see flaws in your usage but that doesn't mean you aren't onto something to look into, meaning, I think nanfill() is behaving ok normally but my hunch is that it could be improved for extreme cases, I'll explain at the end.

Quite a few points to make:
1. In the custom factor, feed it a good window_length so it has something to forward fill from, rather than 1 for window_length.
2. Inputs to the custom factors shouldn't be .latest. Instead like below.
3. Do nanfill() first thing in compute, before any calculations.
4. Then after nanfill(), can do the calculations using [-1] for example, or [-3:].mean() or whatever, and if the window length is long enough to go back far enough to the point where there is a value, then all nans beyond it to the end should become that value.

Something along this line ...

    class Capex_To_Cashflows(CustomFactor):  
        inputs=[cfs.capital_expenditure, cfs.free_cash_flow]  
        window_length=10  
        def compute(self, today, assets, out, capital_expenditure, free_cash_flow):  
            out[:] = (capital_expenditure[-1] * 4.) / (free_cash_flow[-1] * 4.)  
    class Capex_To_Cashflows_forwardfill(CustomFactor):  
        inputs=[cfs.capital_expenditure, cfs.free_cash_flow]  
        window_length=10  
        def compute(self, today, assets, out, capital_expenditure, free_cash_flow):  
            capital_expenditure = nanfill(capital_expenditure)  
            free_cash_flow      = nanfill(free_cash_flow)  
            out[:] = (capital_expenditure[-1] * 4.) / (free_cash_flow[-1] * 4.)  
            #out[:] = nanfill((capital_expenditure * 4.) / (free_cash_flow * 4.))

With the change made, as you can see, the default window_lengths are set to 10, however in this case I'm feeding it an override when they are called, where window_length is set to 120, and the outputs seemed to be reasonable the first time I ran that, all different. But there's a fly in the ointment. When I had tried window_length=30, there were two stocks with--not the value of the one before them--but instead with the same value although separated from each other (BCE and BNS). As if, when it fails from no value to forward fill with, it has some information saved and just uses that instead, like in mask or something, I don't know. Then alongside the fly there's an elephant. On rerunning with 120, it was back to the same problem as with 30. Odd. This is a job for a Dan Whitnable or something.

Stefan

Thanks for your explanation, makes it a bit clearer for me. But you know what I find interesting. When I clone your posted notebook now and then run all cells with your updated code, I don't get the same feed forward values you got. I get again the value from the stock in the line above...

Funny that we don't get the same results with the same code.

Blue Seahawk

Right, there's inconsistency going on, holy smokes. And on the main issue ...

Yep, your point is valid, glad you took the time to bring that to light.

I can see from your imports Stefan that it must have taken some effort to trim back for the example provided, thanks again, just that I had a bit of trouble wrapping my head around its complexity even so.

Here, simplified further, and in case the IDE debugger might help someone looking into this.

Shows that even with a large window length the problem is sometimes there.

                       Capex_To_Cashflows  Capex_To_Cashflows_filled  
    Equity(755 [BC])             0.525970                   0.525970  
    Equity(766 [BCE])                 NaN                   0.525970  
    Equity(794 [BDX])           -0.332198                  -0.332198  
    2019-07-03 05:45 before_trading_start:46 INFO .  
                        Capex_To_Cashflows  Capex_To_Cashflows_filled  
    Equity(980 [BMY])            -0.172007                  -0.172007  
    Equity(1010 [BNS])                 NaN                  -0.172007  
    Equity(1023 [BOH])           -2.092683                  -2.092683  
    2019-07-03 05:45 before_trading_start:46 INFO .  
                         Capex_To_Cashflows  Capex_To_Cashflows_filled  
    Equity(1385 [CDNS])            -0.08978                  -0.089780  
    Equity(1402 [CEF])                  NaN                   3.365413  
    Equity(1419 [CERN])            -1.77185                  -1.771850  
    2019-07-03 05:45 before_trading_start:46 INFO .  
                           Capex_To_Cashflows  Capex_To_Cashflows_filled  
    Equity(1637 [CMCS_A])           -0.574695                  -0.574695  
    Equity(1655 [CMO])                    NaN                 -55.563744  
    Equity(1665 [CMS])              -4.660550                  -4.660550  
    2019-07-03 05:45 before_trading_start:46 INFO .  
                        Capex_To_Cashflows  Capex_To_Cashflows_filled  
    Equity(1789 [COT])           10.440000                  10.440000  
    Equity(1792 [CP])                  NaN                  10.440000  
    Equity(1795 [CPB])           -0.336283                  -0.336283

    [ ............... ]

It was something I had copied from stackoverflow.

Blue Seahawk

[Edit: Their challenge would be, going back in time for each nan stock individually until a latest value is found].

Can Q provide us with a forward fill option for custom factors?

Stefan

Yes, that would be great. Just to explain to Q why forward filling is an essential tool to have:

If you are doing machine learning on a time series you need to have the NaN values changed into numbers. Typically, the mean, mode or whatever would suffice for data that is NOT a time series, but time series are heavily autocorrelated. Due to this autocorrelation the only logical choice for replacing nulls is with the last known value => therefore forward filling is essential for machine learning.

Mark Ivanov

Hi! You can use this code for backfilling:

        arr = arr[::-1].T  
        mask = np.isnan(arr)  
        idx = np.where(~mask,np.arange(mask.shape[1]),0)  
        np.maximum.accumulate(idx,axis=1, out=idx)  
        arr = arr[np.arange(idx.shape[0])[:,None], idx]  
        arr = arr.T[::-1]

Regards,
Mark

Stefan

Anybody from Quantopian already had time to look into the wrong forward filling of NaNs as described above by me and Blue Seahawk?

Joakim Arvidsson (Cream Mongoose)

Bump.

J. Womack

Bump...again. This is standard capability in most platforms providing access to fundamental data. Not sure why it's not just built in to Pipeline so we don't have to be concerned about data integrity.

Rohit Behl

I agree. this is an essential feature

You've successfully submitted a support ticket.

Our support team will be in touch soon.