Does Custom Factor always have to return an array of length N

Back to Community

posted Jul 25, 2018

Hi all,

I am trying to calculate the mean of top 80% quantile of the earning yield from the fundamental data. So for each day, I would like to see one number only.
I am doing this through a pipeline.
The final result that I want to see (per day) is just the last number at the end of the notebook(per day).
However, the result from Pipeline contain a lot of NAN and the same number that I needed in an N dimension array.
After looking through API, it seems that it documented as :" The job of the compute function is to produce a one-dimensional array of length N as an output."
Is it a default behavior ? Can it be changed?

Thanks,
L.T.

2 responses

Dan Whitnable

Jul 25, 2018

By definition factors return columns of n results for n securities. I like to think of the pipeline as just a database call, or maybe even more simply, populating a spreadsheet with data (since it's always a 2D table). Don't expect it to return single values or anything which isn't in a format with securities for the rows and data values for the columns. Now, one CAN force a single value into all rows of a factor but just because one can doesn't mean that one should. There may be times this is appropriate, but to do simple column calculations maybe don't.

I would do exactly as in your spreadsheet and calculate the mean (or other values) outside of the pipeline. Probably right after the pipeline_output method in the '"before_trading_start" function. This has always seemed the cleanest to me. This is especially powerful and intuitive since the pipeline returns a Pandas dataframe. One can easily use any of the pandas methods. I think of a three step process. 1) get the raw data from the pipeline. 2) do any post processing on the data to complete all the data the algo needs. 3) execute logic using this data to guide ones algo. I find this a clean approach and separates the data from the logic.

Just my opinion...

Good luck.

Long Tat

Jul 25, 2018

Thanks Dan.
It was a really constructive and detail explanation.
I do have a second notebook that did something similar to what you suggested.
I was under the impression (after searching through the forum answers and questions) that pipeline is a much more powerful tools than just simply querying data. In the past, they have get_fundamental_data function and pipeline was stated to a better replacement of this function.
If pipeline can just calculate the value and return it, then with a reschedule function, there is no need for the "before_trading_start" function.
Also, I think majority of cross-sectional calculation should result into a few number. After all, the reason to do cross-sectional is hoping that the law of large number will mask out some of the noises. If that is the case, why would returning a specific value per crossectional date not make sense? I guess I can see in certain situation like calculating rolling returns or MA,it make sense to return an array of length N securies.
In summary, I see the potential benefit of dynamically adjust the data structure of the return of custom factor. This behavior is widely adapted in R programming. I am a new user to Quantopia. I am guessing this might due to some limitation that I am unable to see at the moment,

You've successfully submitted a support ticket.

Our support team will be in touch soon.