Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Arithmetic on pipeline factors

I'm trying to do the following on 3 pipeline factors - to combine them to create a single super-alpha:
- zscore() to scale them equally
- winzorise them (limit them to between -3 to +3)
- add them together

There is a method to do the z-scaling and you can add factors - but I'm not sure how to to the winzorising...

6 responses

Zipline (which Quantopian uses) does have a 'winsorize' method. It's just not documented in the Quantopian docs. Check out the zipline docs here https://www.zipline.io/appendix.html#zipline.pipeline.Factor.winsorize .

The documentation states

    Winsorizing changes values ranked less than the minimum percentile to  
    the value at the minimum percentile. Similarly, values ranking above  
    the maximum percentile are changed to the value at the maximum  
    percentile.

    Winsorizing is useful for limiting the impact of extreme data points  
    without completely removing those points.

    If ``mask`` is supplied, ignore values where ``mask`` returns False  
    when computing percentile cutoffs, and output NaN anywhere the mask is  
    False.

    If ``groupby`` is supplied, winsorization is applied separately  
    separately to each group defined by ``groupby``.

    Parameters  
    ----------  
    min_percentile: float, int  
        Entries with values at or below this percentile will be replaced  
        with the (len(input) * min_percentile)th lowest value. If low  
        values should not be clipped, use 0.  
    max_percentile: float, int  
        Entries with values at or above this percentile will be replaced  
        with the (len(input) * max_percentile)th lowest value. If high  
        values should not be clipped, use 1.  
    mask : zipline.pipeline.Filter, optional  
        A Filter defining values to ignore when winsorizing.  
    groupby : zipline.pipeline.Classifier, optional  
        A classifier defining partitions over which to winsorize.

    Returns  
    -------  
    winsorized : zipline.pipeline.Factor  
        A Factor producing a winsorized version of self.

    Examples  
    --------  
    .. code-block:: python

        price = USEquityPricing.close.latest  
        columns={  
            'PRICE': price,  
            'WINSOR_1: price.winsorize(  
                min_percentile=0.25, max_percentile=0.75  
            ),  
            'WINSOR_2': price.winsorize(  
                min_percentile=0.50, max_percentile=1.0  
            ),  
            'WINSOR_3': price.winsorize(  
                min_percentile=0.0, max_percentile=0.5  
            ),

        }

    Given a pipeline with columns, defined above, the result for a  
    given day could look like:

    ::

                'PRICE' 'WINSOR_1' 'WINSOR_2' 'WINSOR_3'  
        Asset_1    1        2          4          3  
        Asset_2    2        2          4          3  
        Asset_3    3        3          4          3  
        Asset_4    4        4          4          4  
        Asset_5    5        5          5          4  
        Asset_6    6        5          5          4

    See Also  
    --------  
    :func:`scipy.stats.mstats.winsorize`  
    :meth:`pandas.DataFrame.groupby`

Winsorizing limits the values to a given percent of the range and not absolute values. One can't use this method to limit values to +/- 3 for instance.

But assuming one wants to limit values to between 3% - 97% then something like this should work.

factor1 = SomeFactor1().zscore().winsorize( min_percentile=0.3, max_percentile=0.97 )  
factor2 = SomeFactor2().zscore().winsorize( min_percentile=0.3, max_percentile=0.97 )  
factor3 = SomeFactor3().zscore().winsorize( min_percentile=0.3, max_percentile=0.97 )  
combined_factor = factor1+factor2+factor3

Great, thanks Dan. Good to know I should check the zipline docs too for solutions.

Another way to do this sort of thing is to pass the output of each Pipeline factor to a global function (e.g. see https://www.quantopian.com/posts/long-short-multi-equity-algo) for preprocessing, prior to combination. Then, you can use:

https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.mstats.winsorize.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html

To me, this architecture is preferred, since one does all of the factor preprocessing in one place, and it is extensible to any Python libraries (or custom code) that one would want to apply.

That said, perhaps there are performance advantages to using built-in Pipeline tools?

Also, I'm not sure how to pass the output of a built-in Pipeline factor to a global function. Presumably, it is just a matter of calling the built-in Pipeline function from within a Pipeline custom factor. If this can be demonstrated, then it is matter of writing all factors as custom, and then passing the output to a global preprocess function.

Grant, I tried to apply arbitrary functions to factors and ran into problems because they were converting the type from a Factor object to an int (say).

Addition works because I imagine the + operator is overloaded. More complex operations might require you to subclass the Factor class and modify the method. I assume zscore() is a method of Factor.

I don't know for sure because Python isn't my mother tongue.

I tried custom factors too - but the trouble is that you can only pass a single dataset to the compute method, extra named parameters were not allowed. I could easily be missing something though.

I would definitely prefer to use the method you mention because there are other more complex ways to combine alphas that I wanted to try...

@Nick: You are correct about the + operator being overloaded, along with -, /, and *. You're also correct that zscore is a method on the Factor class. You can find all of the methods on the Factor class here in the API reference.

If it helps, CustomFactors can be passed multiple datasets (technically, BoundColumns are passed to compute). You just need to make sure that the number of BoundColumns included in the signature of compute is the same as the number being passed when it's called. You can find an example that uses two inputs in lesson 10 of the Pipeline tutorial (look at the TenDayMeanDifference example). Let me know if you have any questions.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Jamie. I was actually able to do what I wanted just using the overloaded +, -, * and /

I can see that passing a list of all the data sets I'm using to a single custom factor would enable any kind of processing though.