Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Struggling to replace NaNs with 0 in Pipeline in Research notebook

I'm consuming analyst concensus estimate ratings from Zacks and I'm trying to build an algorithm that wieghs different ratings. It's currently in research phase, so I'm trying to implement it on Research, but get stuck to technical issues.
This is what I'm trying to do:

from quantopian.pipeline.data.zacks import broker_ratings  
import numpy as np

def make_pipeline():  
    #Get analyst estimations  
    rating_strong_buys = broker_ratings.rating_cnt_strong_buys.latest  
    rating_buys = broker_ratings.rating_cnt_mod_buys.latest  
    rating_holds = broker_ratings.rating_cnt_holds.latest  
    rating_sells = broker_ratings.rating_cnt_mod_sells.latest  
    rating_strong_sells = broker_ratings.rating_cnt_strong_sells.latest

    # Calculate rating coefficient  
    ratings = np.array([rating_strong_buys, rating_buys, rating_holds, rating_sells, rating_strong_sells], dtype=float)  
    if np.isnan(ratings).any():  
        ratings = 0  
    rating_weights = np.array([0.025, 0.015, 0.005, -0.005, -0.015])  
    rating_coefficient = (ratings*rating_weights).sum()  

I want to replace the ratings array with 0 if any of the ratings is NaN for further calculation purposes, but there are 2 problems:
When a rating is not defined it seems to be a sequence

my_pipe = make_pipeline()

---------------------------------------------------------------------------  
ValueError                                Traceback (most recent call last)  
<ipython-input-76-dcf250cf5490> in <module>()  
----> 1 my_pipe = make_pipeline()

<ipython-input-75-3e5260d3bc43> in make_pipeline()  
     24  
     25  
---> 26     ratings = np.array([rating_strong_buys, rating_buys, rating_holds, rating_sells, rating_strong_sells], dtype=float)  
     27     rating_weights = np.array([0.025, 0.015, 0.005, -0.005, -0.015])  
     28     rating_coefficient = (ratings*rating_weights).sum()

ValueError: setting an array element with a sequence.  

However, when I comment out calculations and run make_pipeline to see missing ratings, they are just NaNs.

How can I successfully replace these NaNs in Pipeline?

5 responses

I believe the problem isn't with NaNs but rather your use of putting factors into a Numpy array. You indicate a dtype = float. The array elements however are factors and not floats and not even the actual data. Factors are objects which represent how to get data and not the data itself. See https://www.quantopian.com/help#building-computations.

So, if you eliminate the dtype parameter it (surprisingly to me) works. I've attached a notebook with an example. Not sure this is the most efficient way to do this. May also want to use a mask for the factors which could speed things up?

Good luck!

Hi Dan! Thanks a lot for the necessary education on factors. I did the tutorial on Pipeline, but now it really hit me.

There's one problem though: I need to have the NaN's replaced before I run the Pipeline, because the sum of weighted ratings is going to feed into a new variable potential together with couple of other ones. I'm adding the Notebook here so you can see the logic. I want to filter the requests to only top100 ranking stocks by potential, but this needs to be done before the Pipeline runs to optimise calculations.

Do you have any idea how to do that before the Pipeline runs?

Siim,

I wouldn't try to do all of this in the pipeline calculations. I would break the problem in two. First, use the pipeline to define all the raw data (and maybe some simple calculated columns), and second, run the pipeline to return a dataframe, then perform all your calculations, sorting, filtering etc on that dataframe. That way you have all the power of Pandas working with actual data and aren't constrained with the factor construct.

This is actually the approach I generally take. Set up the pipeline to simply retrieve and aggregate all the data. In the "before_trading" function, run the pipeline, get the pipeline output as a dataframe with all the data, then perform all the selection logic.

Define the data in the pipeline setup. Retrieve and manipulate the data in the "before_trading" function.

See the attached notebook. One added benefit of this method, when using notebooks, is the visibility of the data after each calculation.

Also, you may want to use the new(ish) builtin universe filters (like Q1500US) instead of rolling your own. See https://www.quantopian.com/help#built-in-filters.

Good luck.

Thanks, Dan! It's much appreciated. When doing the Pipeline tutorial, I was under the impression that you'd want to do as much as possible in terms of filtering and computations in in Pipeline in order not to retreive the entire universe all the time. I haven't run the algorithm in the IDE yet, so I have no idea how much it would take, but I'll try your recommended approach as it seems to be much easier and better structured.

When trying to solve this problem, I was thinking if one really wanted, couldn't it be solved with Custom Factors? Or would that introduce the same result?

Siim, custom factors are something to look into (one more tool in the tool box). They are pretty straightforward once you get the hang of it. One can also pass initialization parameters when instantiating the class which can make these factors more general (see https://www.quantopian.com/posts/adding-my-own-parameters-to-customfactor). One constraint however, is currently the inputs to a factor must be a BoundColumn. One cannot use the results of a factor as an input. (but one can do some interesting things even with that limitation see https://www.quantopian.com/posts/composite-inputs-for-customfactors ).