Hey guys, I am creating a new algorithm and I was looking some other codes from other people. Some of them just use the function "factors = make_factors". I wanted to know how and when to use this function
Hey guys, I am creating a new algorithm and I was looking some other codes from other people. Some of them just use the function "factors = make_factors". I wanted to know how and when to use this function
The short answer why some individuals create a make_factors
function is to save typing. There isn't a standard (and certainly not built-in) function called make_factors
and one will see several implementations and approaches. The intention though is to return an iterable, often a dict, which contains the factor class names and an associated string name which one wants to use in their pipeline definition. If one wants to run many versions of a pipeline (either in an algo or a notebook) each having different permutations of a number of factors, one could create a make_factors
function. To test different sets of these factors, all one needs to do is comment out specific factors in the function. No need to edit the entire code.
Here is an example
def make_factors():
"""
First code all the factor classes one wants to use.
"""
class Returns_To_Last_Month(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 200
def compute(self, today, assets, out, close):
out[:] = (close[-40] - close[0]) / close[0]
class Returns_To_Current_Month(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 200
def compute(self, today, assets, out, close):
out[:] = (close[-20] - close[0]) / close[0]
class Returns_To_Current_Week(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 20
def compute(self, today, assets, out, close):
out[:] = (close[-5] - close[0]) / close[0]
# Finally, return a dict of factors and associated names
# One can easily modify desired dict of factors with comments
return { 'Returns_To_Last_Month': Returns_To_Last_Month,
# 'Returns_To_Current_Month': Returns_To_Current_Month,
'Returns_To_Current_Week': Returns_To_Current_Week,
}
Now, that we have an iterable of factors it can be used to build a pipeline definition. Maybe like this.
def make_pipeline():
# Define universe
universe = QTradableStocksUS()
# Define base pipeline
pipe = Pipeline(columns = {}, screen=universe)
# Get a dict of the factors we want to use
factors = make_factors()
# Add zscores to our pipeline output
# Create a score by summing zscores
score = 0
for factor_name, factor_class in factors.items():
zscore = factor_class(mask=universe).zscore(mask=universe)
name = factor_name + '_zscore'
# Append all our factor zscores to the pipeline
pipe.add(zscore, name)
# Sum up the zscores to get a total score
score += zscore
# Append score to the pipeline
pipe.add(score, 'score')
return pipe
If one has a large number of factors to test, it can be a lot less typing, and potentially a lot less errors, by iterating through a factor dict to create our pipeline definition rather entering and modifying by hand. That said, it does sacrifice a bit of readability. This approach is typically only used when looking at many (more than 10) factors and doing consistent logic (eg summing zscores). Otherwise it may be just as easy to cut and paste. Everyone has their own coding approach and style. This is just another tool for one's toolbox.
The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.
Dean, thank you very much for you answer!! I'm new to Quantopian world and your answer helped me a lot to read codes that I was struggling with!!
However, I am having serious difficulties when you iterate your CustomFactors, at this lines:
for factor_name, factor_class in factors.items():
zscore = factor_class(mask=universe).zscore(mask=universe)
name = factor_name + '_zscore'
Why do you have to iterate your Custom Factors before putting in the Pipeline? While Standard Factors (Like SimpleMovingAverage) you can put it directly on the pipeline? And why do you use mask twice?
In the second notebook of this link:
https://www.quantopian.com/posts/new-video-learn-from-the-experts-ep-1-full-algorithm-creation-with-vedran-rusman
Instead of factor_class he uses (f,w). Can you explain me why this happens? Is he is using a tuple (f, w) instead of one value(factor_class) to iterate ? What is happening?
I was thinking about making zscores relative to each sector, instead of making zscore relative to all the companies. What is the most effective way of doing this?
Thank you for everything and sorry for all this newbie questions, but I am very excited to make my first algorithm :D
I'll try to answer the questions above...
1. Why do you have to iterate your Custom Factors before putting in the Pipeline? While Standard Factors (Like SimpleMovingAverage) you can put it directly on the pipeline? And why do you use mask twice?
The reason one iterates over the dict of custom factors is simply to save typing them out individually. Assume the factors in our factors dict are Returns_To_Last_Month
, Returns_To_Current_Month
, and Returns_To_Current_Week
. Then the two chunks of code below will produce the EXACT same pipeline definitions.
for factor_name, factor_class in factors.items():
zscore = factor_class(mask=universe).zscore(mask=universe)
name = factor_name + '_zscore'
# Append all our factor zscores to the pipeline
pipe.add(zscore, name)
# Sum up the zscores to get a total score
score += zscore
zscore_1 = Returns_To_Last_Month(mask=universe).zscore(mask=universe)
zscore_2 = Returns_To_Current_Month(mask=universe).zscore(mask=universe)
zscore_3 = Returns_To_Current_Week(mask=universe).zscore(mask=universe)
pipe.add(zscore1, "Returns_To_Last_Month")
pipe.add(zscore2, "Returns_To_Current_Month")
pipe.add(zscore3, "Returns_To_Current_Week")
score = zscore_1 + zscore_2 + zscore_3
The first chunk of code iterates through the dict and builds the pipeline procedurally. The second chunk of code does the same thing just explicitly types out each line. In this case, the first approach uses 5 lines of code (excluding comments). The second approach takes 7 lines. If one had 30 factors, the first approach would still use 5 lines, however the second approach would use 61 lines. If one has a lot of factors that's a lot of extra typing. Moreover, that's a lot of potential places to make a mistake. The first approach only requires a single change to the factors dict and one can be confident it will work the same with 3 factors or 30. The second approach is, arguably, more readable and 'self documenting'. So, I personally would always use the second approach unless one has a lot of factors and wishes to try many different permutations. Again, it's just to save typing and is personal preference.
Now, for the second part of the question "why do you use mask twice?". Good observation! It is a bit redundant in this case. Masks limit the 'universe' of assets given to a factor. Without a mask, a factor will calculate a value for every asset in the Quantopian database. This is roughly 9000 assets including ETFs and ETNs. If one is only ever concerned with a smaller subset of all these assets, using a mask will reduce the calculation time. Most of the built in factors are quite fast and one probably won't notice the difference. However, sometimes custom factors can be quite compute intensive and adding a mask can often speed things up. There isn't typically a downside to using a mask. That was why a mask was used for factor_class(mask=universe)
.
A second reason to use a mask is to explicitly define the universe for methods. Notice the first case limited the universe for a factor. This second case limits the universe for a method. The example above uses zscore(mask=universe)
. The zscore method calculates the mean and standard deviation of 'universe' to determine each individual zscore. Without a mask, it would use all the factor values as the 'universe'. Most often one wants to zscore against like assets and not every asset (like those ETFs and ETNs included in the entire Quantopian database). The mask in this case isn't about speed but will actually impact the calculations. In this case though, since we already masked the factor, the same mask isn't required for the method. I usually include it to simply be explicitly clear about what universe we are zscoring against.
One word of caution while on the topic of masks for zscore
. The zscore method handles nans gracefully, however, it does not handle infinite values well. If there is a chance that any factor values would ever be +inf
or -inf
then include a mask like this
my_factor_zscore = my_factor.zcore(mask=my_factor.isfinite())
2. Instead of factor_class he uses (f,w). Can you explain me why this happens? Is he is using a tuple (f, w) instead of one value(factor_class) to iterate ? What is happening?
Everyone has a little different approach to coding a make_factors
function and what the output of that function is. My make_factors
example above simply created a dict which held a factor name and a factor class. The make_factors
function in the referenced notebook adds one more piece of information... a weight. It stores a factor name (as the key) and then associates it to a tuple which has the factor class and a weight. Using a tuple is a way to store two pieces of data in a dict instead of just one. Iterating over the factors dict will produce the name, the class, and a weight.
I want to point out something which I hope is apparent by now. Creating a make_factors
method will make your code harder to read and understand. Period. Unless there is a compelling reason, simply define all your factors explicitly. The resulting pipelines will be identical.
3. I was thinking about making zscores relative to each sector, instead of making zscore relative to all the companies. What is the most effective way of doing this?
The zscore
method has a parameter specially for this purpose called groupby
(see the docs). So something like the following will zscore relative to the sector a stock belongs to and not relative to all stocks in the universe.
from quantopian.pipeline.classifiers.morningstar import Sector
zscore = factor_class(mask=universe).zscore(groupby=Sector())
Hope those answers helped?
The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.
Dan, thank you very so much for the answers, they are mind-blowing for me! Your answers are really helping me to improve quickly and I am making huge progress in developing my factors now. However, I am still having problems to run the "make_pipeline()" function :(
After the
return pipe
inside make_pipeline() function I tried to use this code to run the pipeline:
run_pipeline(make_pipeline(), start_date = start, end_date = end)
But it gets into this error:
--------------------------------------------------------------------------- TypeError
Traceback (most recent call last) in ()
----> 1 run_pipeline(make_pipeline(), start_date = start, end_date = end)
in make_pipeline()
10
11 for name, value in factors.items():
---> 12 zscore = value(mask=universe).zscore(mask=universe, groupby = Sector())
13 name_zscore = name + '_zscore'
14
TypeError: 'Latest' object is not callable
I even tried to make the function make_pipeline() return me this:
return run_pipeline(pipe, start_date = start, end_date = end)
instead of
return pipe
as you did in your algorithms. But nothing happened in this case.
After defining the make_pipeline() func, how can I run this pipeline in the notebook? What am I doing wrong this cases above? And why if try to execute this command:
run_pipeline(pipe, start_date = start, end_date = end)
after the return, it gets an error saying that " 'pipe' is not defined" ?
Ps: Sorry for so many questions