What does make_factors function do?

Back to Community

posted

Hey guys, I am creating a new algorithm and I was looking some other codes from other people. Some of them just use the function "factors = make_factors". I wanted to know how and when to use this function

5 responses

Dan Whitnable

The short answer why some individuals create a make_factors function is to save typing. There isn't a standard (and certainly not built-in) function called make_factors and one will see several implementations and approaches. The intention though is to return an iterable, often a dict, which contains the factor class names and an associated string name which one wants to use in their pipeline definition. If one wants to run many versions of a pipeline (either in an algo or a notebook) each having different permutations of a number of factors, one could create a make_factors function. To test different sets of these factors, all one needs to do is comment out specific factors in the function. No need to edit the entire code.

Here is an example

def make_factors():  
    """  
    First code all the factor classes one wants to use.  
    """  
    class Returns_To_Last_Month(CustomFactor):  
        inputs = [USEquityPricing.close]  
        window_length = 200  
        def compute(self, today, assets, out, close):  
            out[:] = (close[-40] - close[0]) / close[0]

    class Returns_To_Current_Month(CustomFactor):  
        inputs = [USEquityPricing.close]  
        window_length = 200  
        def compute(self, today, assets, out, close):  
            out[:] = (close[-20] - close[0]) / close[0]

    class Returns_To_Current_Week(CustomFactor):  
        inputs = [USEquityPricing.close]  
        window_length = 20  
        def compute(self, today, assets, out, close):  
            out[:] = (close[-5] - close[0]) / close[0]

    # Finally, return a dict of factors and associated names  
    # One can easily modify desired dict of factors with comments  
    return { 'Returns_To_Last_Month': Returns_To_Last_Month,  
           # 'Returns_To_Current_Month': Returns_To_Current_Month,  
             'Returns_To_Current_Week': Returns_To_Current_Week,  
           }

Now, that we have an iterable of factors it can be used to build a pipeline definition. Maybe like this.

def make_pipeline():  
   # Define universe  
    universe = QTradableStocksUS()

    # Define base pipeline  
    pipe = Pipeline(columns = {}, screen=universe)

    # Get a dict of the factors we want to use  
    factors = make_factors()

    # Add zscores to our pipeline output  
    # Create a score by summing zscores  
    score = 0  
    for factor_name, factor_class in factors.items():  
        zscore = factor_class(mask=universe).zscore(mask=universe)  
        name = factor_name + '_zscore'

        # Append all our factor zscores to the pipeline  
        pipe.add(zscore, name)

        # Sum up the zscores to get a total score  
        score += zscore

    # Append score to the pipeline  
    pipe.add(score, 'score')

    return pipe

If one has a large number of factors to test, it can be a lot less typing, and potentially a lot less errors, by iterating through a factor dict to create our pipeline definition rather entering and modifying by hand. That said, it does sacrifice a bit of readability. This approach is typically only used when looking at many (more than 10) factors and doing consistent logic (eg summing zscores). Otherwise it may be just as easy to cut and paste. Everyone has their own coding approach and style. This is just another tool for one's toolbox.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Leonardo Cunha

Dean, thank you very much for you answer!! I'm new to Quantopian world and your answer helped me a lot to read codes that I was struggling with!!
However, I am having serious difficulties when you iterate your CustomFactors, at this lines:

        for factor_name, factor_class in factors.items():  
        zscore = factor_class(mask=universe).zscore(mask=universe)  
        name = factor_name + '_zscore'

Why do you have to iterate your Custom Factors before putting in the Pipeline? While Standard Factors (Like SimpleMovingAverage) you can put it directly on the pipeline? And why do you use mask twice?
In the second notebook of this link:
https://www.quantopian.com/posts/new-video-learn-from-the-experts-ep-1-full-algorithm-creation-with-vedran-rusman
Instead of factor_class he uses (f,w). Can you explain me why this happens? Is he is using a tuple (f, w) instead of one value(factor_class) to iterate ? What is happening?
I was thinking about making zscores relative to each sector, instead of making zscore relative to all the companies. What is the most effective way of doing this?

Thank you for everything and sorry for all this newbie questions, but I am very excited to make my first algorithm :D

Dan Whitnable

I'll try to answer the questions above...

1. Why do you have to iterate your Custom Factors before putting in the Pipeline? While Standard Factors (Like SimpleMovingAverage) you can put it directly on the pipeline? And why do you use mask twice?

The reason one iterates over the dict of custom factors is simply to save typing them out individually. Assume the factors in our factors dict are Returns_To_Last_Month, Returns_To_Current_Month, and Returns_To_Current_Week. Then the two chunks of code below will produce the EXACT same pipeline definitions.

    for factor_name, factor_class in factors.items():  
        zscore = factor_class(mask=universe).zscore(mask=universe)  
        name = factor_name + '_zscore'

        # Append all our factor zscores to the pipeline  
        pipe.add(zscore, name)

        # Sum up the zscores to get a total score  
        score += zscore

    zscore_1 = Returns_To_Last_Month(mask=universe).zscore(mask=universe)  
    zscore_2 = Returns_To_Current_Month(mask=universe).zscore(mask=universe)  
    zscore_3 = Returns_To_Current_Week(mask=universe).zscore(mask=universe)  

    pipe.add(zscore1, "Returns_To_Last_Month")  
    pipe.add(zscore2, "Returns_To_Current_Month")  
    pipe.add(zscore3, "Returns_To_Current_Week")

    score = zscore_1 + zscore_2 + zscore_3

The first chunk of code iterates through the dict and builds the pipeline procedurally. The second chunk of code does the same thing just explicitly types out each line. In this case, the first approach uses 5 lines of code (excluding comments). The second approach takes 7 lines. If one had 30 factors, the first approach would still use 5 lines, however the second approach would use 61 lines. If one has a lot of factors that's a lot of extra typing. Moreover, that's a lot of potential places to make a mistake. The first approach only requires a single change to the factors dict and one can be confident it will work the same with 3 factors or 30. The second approach is, arguably, more readable and 'self documenting'. So, I personally would always use the second approach unless one has a lot of factors and wishes to try many different permutations. Again, it's just to save typing and is personal preference.

Now, for the second part of the question "why do you use mask twice?". Good observation! It is a bit redundant in this case. Masks limit the 'universe' of assets given to a factor. Without a mask, a factor will calculate a value for every asset in the Quantopian database. This is roughly 9000 assets including ETFs and ETNs. If one is only ever concerned with a smaller subset of all these assets, using a mask will reduce the calculation time. Most of the built in factors are quite fast and one probably won't notice the difference. However, sometimes custom factors can be quite compute intensive and adding a mask can often speed things up. There isn't typically a downside to using a mask. That was why a mask was used for factor_class(mask=universe).

A second reason to use a mask is to explicitly define the universe for methods. Notice the first case limited the universe for a factor. This second case limits the universe for a method. The example above uses zscore(mask=universe). The zscore method calculates the mean and standard deviation of 'universe' to determine each individual zscore. Without a mask, it would use all the factor values as the 'universe'. Most often one wants to zscore against like assets and not every asset (like those ETFs and ETNs included in the entire Quantopian database). The mask in this case isn't about speed but will actually impact the calculations. In this case though, since we already masked the factor, the same mask isn't required for the method. I usually include it to simply be explicitly clear about what universe we are zscoring against.

One word of caution while on the topic of masks for zscore. The zscore method handles nans gracefully, however, it does not handle infinite values well. If there is a chance that any factor values would ever be +inf or -inf then include a mask like this

my_factor_zscore = my_factor.zcore(mask=my_factor.isfinite())

2. Instead of factor_class he uses (f,w). Can you explain me why this happens? Is he is using a tuple (f, w) instead of one value(factor_class) to iterate ? What is happening?

Everyone has a little different approach to coding a make_factors function and what the output of that function is. My make_factors example above simply created a dict which held a factor name and a factor class. The make_factors function in the referenced notebook adds one more piece of information... a weight. It stores a factor name (as the key) and then associates it to a tuple which has the factor class and a weight. Using a tuple is a way to store two pieces of data in a dict instead of just one. Iterating over the factors dict will produce the name, the class, and a weight.

I want to point out something which I hope is apparent by now. Creating a make_factors method will make your code harder to read and understand. Period. Unless there is a compelling reason, simply define all your factors explicitly. The resulting pipelines will be identical.

3. I was thinking about making zscores relative to each sector, instead of making zscore relative to all the companies. What is the most effective way of doing this?

The zscore method has a parameter specially for this purpose called groupby (see the docs). So something like the following will zscore relative to the sector a stock belongs to and not relative to all stocks in the universe.

        from quantopian.pipeline.classifiers.morningstar import Sector  
        zscore = factor_class(mask=universe).zscore(groupby=Sector())

Hope those answers helped?

Disclaimer

Leonardo Cunha

Dan, thank you very so much for the answers, they are mind-blowing for me! Your answers are really helping me to improve quickly and I am making huge progress in developing my factors now. However, I am still having problems to run the "make_pipeline()" function :(

After the return pipe inside make_pipeline() function I tried to use this code to run the pipeline:

run_pipeline(make_pipeline(), start_date = start, end_date = end)

But it gets into this error:

--------------------------------------------------------------------------- TypeError
Traceback (most recent call last) in ()
----> 1 run_pipeline(make_pipeline(), start_date = start, end_date = end)

in make_pipeline()
10
11 for name, value in factors.items():
---> 12 zscore = value(mask=universe).zscore(mask=universe, groupby = Sector())
13 name_zscore = name + '_zscore'
14

TypeError: 'Latest' object is not callable

I even tried to make the function make_pipeline() return me this:

return run_pipeline(pipe, start_date = start, end_date = end)

instead of return pipe as you did in your algorithms. But nothing happened in this case.

After defining the make_pipeline() func, how can I run this pipeline in the notebook? What am I doing wrong this cases above? And why if try to execute this command:
run_pipeline(pipe, start_date = start, end_date = end) after the return, it gets an error saying that " 'pipe' is not defined" ?

Ps: Sorry for so many questions

Leonardo Cunha

Sorry, i forgot to attach my notebook

You've successfully submitted a support ticket.

Our support team will be in touch soon.