Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
algo share

Here's an algo I've been working on, mainly to develop a framework and to learn. Comments, questions, improvements welcome.

I did not conceive of most of the factors (and some of them may be duds). Credit to the original authors.

89 responses

What an impressive effort! Looks nice during the crash too. What if any are your reservations?

Nice one, Grant! Try decreasing your position size (increase the number of stocks traded) and see if your Sharpe increases and portfolio volatility decreases. If it does, then it is good sign that the algo is able to adapt under a more diversified positioning. Best!

Thanks Zenothestoic & James -

I'm mainly interested in the overall architecture at this point, and isolating a handful of sensible "control" parameters that would apply to any algo, and then figuring out how to set them in a systematic way (without "over-fitting").

One thing I think I uncovered is that Pipeline has a limit on the number of custom factors it will support. This is why I have two pipelines for the factors--really ugly, so any insight in this area would be helpful.

For parameters, I have:

#############################  
# algo settings

WIN_LIMIT = 0 # winsorize limit in factor preprocess function  
FACTOR_AVG_WINDOW = 3 # window length of SimpleMovingAverage applied to each factor

# Optimize API constraints  
MAX_POSITION_SIZE = 0.01 # set to 0.01 for ~100 positions  
BETA_EXPOSURE = 0  
USE_MaxTurnover = True # set to True to use Optimize API MaxTurnover constraint  
MIN_TURN = 0.15 # Optimize API MaxTurnover constraint (if optimize fails, incrementally higher constraints will be attempted)

#############################  

I'm thinking I should probably just set a fixed Optimize API beta constraint of zero, and not be tempted to fiddle with it (e.g. make "beta bets" for contest entries).

I wonder whether Quantopian would say that, given the number of factors used, it is not possible to construct a fundamental argument as to why this works? And thus reject the algorithm for opacity?

I have no idea. Perhaps they could comment.

How many staff do they have now and what do they all do? Given the in house skills they have one is bound to wonder whether they are using mostly in house algos for their fund?

@ Zenothestoic -

Regarding your concern about "opacity" the relevant requirement from https://www.quantopian.com/get-funded is:

Strategic Intent

One of the most common ways to overfit is to just try a ton of
variants until one works. We’re looking for algorithms which have a
clear intent in the idea as opposed to algorithms which just happen to
work with no explanation. We’d like you to be able to explain the
original idea and intent of the strategy to us before we license it.

In this case, I think the idea would be for each factor to have some sense behind it, along with each factor to have a decent level of stand-alone predictive power (e.g. analyzed with Alphalens). The combination technique (sum of z-score normalized factors) is simple, and so I wouldn't think it would require any justification beyond describing it.

The large number of factors is more of an experiment in cobbling together factors. I have seen, however, a benefit in smoothing out returns, and extending to a larger number of stocks with added factors, although I have not gone about this systematically.

Hi Grant,

One thing I think I uncovered is that Pipeline has a limit on the number of custom factors it will support. This is why I have two pipelines for the factors--really ugly, so any insight in this area would be helpful.

To overcome this limitation and perhaps be able to run just one pipeline, one idea is to group factors with similar "themes" by say, adding or averaging them into one factor. For example, I see a lot of growth factors, you can combine them into one. Doing this will also make explaining strategic intent easier by saying for example that you have discovered that when you combine growth factors with value factors and momentum indicators, it creates a consistent, predictive alpha.

One of the pitfalls of too many factors is more often than not, they're just noise cancelling each other out and those with the "true" predictive powers are the ones that churns out the signal.

@Grant,

I'm very impressed as well, and grateful you're sharing your work! Quite generous of you I think.

I'm curious to hear your view around 'model complexity'. To me, the benefit of adding a few uncorrelated factors is pretty, clear - higher IR, higher risk-adjusted IC, lower volatility, etc, but I wonder if, at some point, there's any negative effects when the model gets overly complex (to me, it's very difficult to understand what the heck is going on if I add too many factors).

In essence, if we assume that ALL models are wrong, and at least slightly fitted on noise (or certain market regimes), then if we combine them into a 'mega-factor' (e.g. as with the 101 Alpha project), then wouldn't the 'noise fitting' increase as well in the combined model? I hope I'm wrong here...

Also, do you reckon that the trick to use factors that have different ideal holding periods (e.g. 1D for one factor, 3D for another, and 5D for yet another) in the same strategy, is to use the SMA for each factor with an ideal holding period longer than 1D (i.e. SMA5 for factors that are best held for 5D)?

Or am I thinking about this incorrectly?

Interesting strategy, couple of thoughts:

  1. The large set of factors would help smooth returns as you found, but can also conceal the signal and introduce noise. For example, if we added a factor that simply chooses a random number between 1 and 100, that would "smooth" returns but does this by weakening the (possibly inaccurate) signals.
  2. Some of these factors may be highly correlated, which introduces a problem of multicollinearity and inaccurate estimates for out-of-sample data. It would be interesting to see a heatmap of the correlations between these factors.
  3. Some of these factors likely have competing signals. For example, there is both 2-yr growth and 8-yr growth ratios. Even if they generated alpha by themselves, they would have different prediction horizons (i.e. the 8-yr growth might do better with less frequent rebalancing). Instead of combining them into one strategy, I think having separate strategies with a common theme or finding some function of the two (such as 2-yr growth / 8-yr growth) would be more robust.
  4. Assuming not all the factors are useful, a straight z-score combination might weaken the signal from the actual predictive ones. I've fed most of the factors in your set into a machine learning classification for their accuracy in predicting long/short signals during the time period:

    A very strong signal from peg_ratio for example, may weaken the predictive signal from growthscore if using a simple combination.

Does the machine learning algorithm produce anything better or different than the simple approach used in Alphalens? Is this PCA? I assume the linear regression of alphalens could just be redrafted to assess the contribution of each factor individually the same way it does at present with the factors it uses as guidelines to assess alpha?

Assuming 2007 and 2008 are truly "out of sample" as opposed to not just used in the above test, then out of sample looks pretty impressive when you run the algo through this period of turmoil. What does 2018 look like? Perhaps Quantopian will be kind enough to tell us?

Interesting to see "growth", "debt/equity" and "momentum" scoring highest. Unsurprisingly.

All -

Thanks for the stimulating feedback. I need to mull it over a bit before responding.

Overall, I'm thinking that, along the lines of the 101 Alphas there should be some systematic way of dealing with the "proliferation of alphas – albeit mostly faint and ephemeral." I figure the probability of my discovering some miracle factor, and getting paid by Q for it is pretty slim; they are looking for full-up, fancy, multi-factor stand-alone algos.

@ James - Thanks for the idea of amalgamating similar factors to reduce the number of factors, as a work-around to the apparent Pipeline limitation (I'd thought of this, but then it really mucks up the idea of having individual alpha factors described on https://blog.quantopian.com/a-professional-quant-equity-workflow/) . I've pinged Q support for guidance, since there seems to be some undocumented limit to the number of custom factors supported by Pipeline.

I figure the probability of my discovering some miracle factor, and getting paid by Q for it is pretty slim; they are looking for full-up, fancy, multi-factor stand-alone algos.

I think you are right. Whatever the rights and wrongs of what you have done with this algo its performance seems excellent in good times and bad. "Can it last" is I suppose the main question which comes to mind? I hope so!

My suggestion would be to get rid of the highly correlated factors by either combining them into one factor or by picking the best performing factor of the highly correlated group. Without testing your factors, I think ROA growth 8yr and ROA growth 2yr are highly correlated and probably ROIC growth is too. Also, when looking at Adam W's graph of feature importances the last three weakest features probably should get dropped. I doubt those features would hold up if you tested each of them in Alphalens.

So is there a standard "recipe" for dealing with correlations? As an example, say I have three factors, A, B, & C. If A & B are perfectly correlated (essentially the same factor), and C is uncorrelated to both A and B, then it is simple: drop either A or B, and make the portfolio A & C (or B & C). But how would one handle things in the non-ideal case of some degree of pairwise correlation, A-to-B , A-to-C, and B-to-C? How should the portfolio be constructed?

I don't think there is a standard way of approaching this, but here are two ideas:

  1. Keep the factors that provide the most signal (measured by IC, alpha, and other metrics you deem important) with the minimal correlation to other factors, while dropping the rest.
  2. Keep all factors, but use a different weighting strategy when combining the factors. A simple equal-weight combination would overweight the correlated factors, so doing a machine learning classification, PCA, factor analysis, decision trees, etc would be very useful.

The second approach would likely have better predictability, since there is still some information even in highly correlated factors (assuming not perfectly correlated). However the first approach would probably be more robust with less risk of overfitting, thus doing better in different market conditions. It is ultimately up to your best judgment whether to keep correlated factors, though if you are backtesting over such a long horizon, market regimes will likely change and so robustness may be more important. If you were to backtest over shorter horizons (i.e. last 2 years such as for the contest), I think predictability may be better at the potential cost of long-term profitability.

Since your algo was trained until November 2017, we can treat the last year as out-of-sample data essentially. (A few factors had to be removed as they referenced the Factset data which is no longer updated apparently, but this shouldn't have too much of an impact given the large set of factors)

The algo still performs very well, though observe the out-of-sample 12-month sharpe ratio (1.5) has dropped from the in-sample 2017 Sharpe (2.3), along with annual returns slightly (7.6%) compared to in-sample (12.7%). The further out-of-sample it gets, the variability of expected returns would increase. Addressing the correlated factors could help significantly and make the algo perform even better.

Here's a modified version of the algo weighted by decision trees. The returns seem to be more robust than the base algo, for example the Sharpe is more stable and reverts around the mean rather than following a trend. It could be interesting to see the results of directly addressing the correlations of factors and dropping and/or combining them.

The pre-preprocess: Is there a simple way to forward fill everything after out before computations are done on them? Example

    def compute(self, today, assets, out, cost_of_revenue, sales):

I've had some good results with nanfill although in this case so much work I haven't tried it.
Maybe Q could offer a switch for forward-filled fundamentals replacing nan with the previous value.

The further out-of-sample it gets, the variability of expected returns would increase.

How very paradoxical and typical of systematic financial market trading. A bunch of intelligent people go to extraordinary lengths to produce Alphalens, Zipline and Pyfolio.

The stated aim is to produce market neutral algos which fare well (smoothly) in up and down markets.

To that end various in/out of sample routines are suggested to make sure that will continue to be the case. It is proposed that fundamental factors should be used as predictors. What more sensible sounding scheme than that? Over the long terms earnings growth and a strong balance sheet are all that counts. Without those a stock will eventually wither and die.

And yet it is still suggested that "The further out-of-sample it gets, the variability of expected returns would increase."

Are there no constants in financial markets? Nothing we can count on? Are we eternally doomed to design complex systems to fail? It would seem so. A system will have its time in the sun and must then be consigned to the dustbin. That is not at all what was envisaged. The whole idea of market neutral is that it should survive and prosper. Equal amounts of long and short. Equal or at least limited exposure to any one sector and trading style. A carefully curated universe of stocks (well that at least has to change day by day).

Do we care? Should we care ?

Well we humans are such short term animals probably not. As long as our algo lasts long enough for us to fill our pockets on fees we should be more than happy.

Should we care to ponder immortality in the context of trading and investment we might be forced to adopt a different and less fancy approach. Who is to say that Morningstar sector definitions have much validity today let alone tomorrow? Remember how the staid telephone utility companies suddenly morphed into debt crippled monsters back in the early 2000s. And the coded definitions of momentum, value or other styles. Do they hold? Are they universal?

Am I a deep cynic? A Jeremiah? If so, do I have reason to be or am I just inadequate at finding the golden key to systematic immortality?

@ Zenothestoic -

I was a bit surprised by the push by Q for users to do a deep dive into fundamentals. I figured that all of the data are in the public domain (or at least readily accessible by the financial industry), and lagging indicators. Also, the data are relatively low-frequency (quarterly/annual). So one would really want something like a ~50 year back test horizon, versus < 20 years (and without the longer time horizon, there could be a risk of over-fitting, due paucity of data).

The other question in my mind is how accurately fundamentals express the true state of a business. Executives probably know better than hedge funds how fundamentals impact share price and do everything they can to present a pretty picture. If there are risks that aren't required by law to be captured in the metrics, the risks are likely to be obscured (and whatever risks are obvious in the fundamentals have already been priced by the market).

Perhaps there are still scraps of inefficiency left in fundamentals, but it isn't intuitive. At a high level, what is the "strategic intent" for using fundamentals in the first place?

Even Bachelier in his 1900 thesis, stated that variance was proportional to the square-root of time. So, we should not be surprised if it does increase with time.

Grant
A mistake I have always made is not to give the customer what he wants. The question here at Quantopian is can any of us provide them with the all weather miracle they and Stevie Wonder desire?

If so (and if you can put up with working at the pace of a constipated snail in the research environment) then do it.

I have never learnt that lesson. I have never striven to give any customer what they seek. My mistake. At the end of the day that is all that is necessary here. Whether I personally believe in the goodness and rightness of their approach should take a back seat to my desire to sell them an algorithm.

If Alphalens actually ran properly I might try for the first time in many years to do the sensible thing, shut up, and give them what they have asked for.

@grant - Your algo share has really thrown the cat amongst the pigeons. Thank you. I have been struggling to find a way to manage a large number of custom factors (in my case multiple instances of same factor) . I've now relocated my custom factor into your framework!

Being a complete novice, I'm just repeating 30 copies of my custom factor (with the different parameters hard coded in each one) in place of your alpha factors.

I just wonder if either you or one of the other proper developers here might know a more graceful way of achieving this?

I attach a notebook with some code provided by @cal showing how to create multiple instances of same custom factor (the custom factor is just a dummy).

Thanks again.

@ Adam W. -

Thanks for the example of weight by decision trees (https://www.quantopian.com/posts/algo-share#5beb8c774aada043d452edec).

Would you be willing to share the code? Or at least provide an outline of how you do it?

@ Joakim -

Also, do you reckon that the trick to use factors that have different ideal holding periods (e.g. 1D for one factor, 3D for another, and 5D for yet another) in the same strategy, is to use the SMA for each factor with an ideal holding period longer than 1D (i.e. SMA5 for factors that are best held for 5D)?

Or am I thinking about this incorrectly?

I got the idea for using the SMA from Thomas W.'s recent post (https://www.quantopian.com/posts/an-updated-method-to-analyze-alpha-factors). I think you are correct that each factor could have its own SMA window length (or more generally low-pass filter frequency cut-off). There's also the possibility of smoothing the combined alpha, and the output of the Optimize API.

I think the idea of doing daily portfolio updates is the way to go, but I have concern that the Optimize API is adding unproductive noise. It seems that one would want to smooth the final output of the Optimize API. The problem is that the optimizer is outside of Pipeline, and so there is no access to the trailing window; one has be accumulated in context.

Did you encounter out of memory? I did with my many fundamental factors

@ Leo c - I have not hit memory limitations.

@ All - if anyone has insight into this error please let me know:

https://www.quantopian.com/posts/pipeline-error-valueerror-too-many-inputs-limit-on-number-of-custom-factors

Breaking up the code into separate Pipelines just to support a relatively large number of factors does not seem like it should be necessary. Perhaps I have a mistake in my implementation?

@ Zenothestoic -

How many staff do they have now and what do they all do? Given the in house skills they have one is bound to wonder whether they are using mostly in house algos for their fund?

I'm not sure it matters. The Q contest is still running and paying out, and reportedly, algos are getting funded (see https://www.quantopian.com/posts/15-contest-entrants-invited-to-license-their-strategies-to-quantopian). Eventually one might expect a saturation point, but we aren't there yet.

I think more importantly, there is a question of what is already in the fund, and how users can evaluate their algos so as not to be rejected due to a too strong correlation with existing fund algos. There used to be a "uniqueness" requirement on https://www.quantopian.com/get-funded, but it got dropped. However, it is reasonable to assume that one is implied.

Recently there was guidance on the need for more fundamentals-based algos. I'm not sure if the request was for algos based solely on fundamentals, or if it was just a prod to dive into the Factset data, and combine new fundamental factors with existing technical and alternative ones.

@Adam

May I know how you do the features importance or correlation among the factors?

One potential weighting scheme would be to compare on a daily basis each alpha vector to the remaining naively combined alpha factors with a similarity metric (e.g. cosine similarity). Then the factors could be weighted by inverse similarity:

alpha_combined = alpha_1/abs(similarity_1) + alpha_2/abs(similarity_2) + ... + alpha_N/abs(similarity_N)

or

alpha_combined = alpha_1/similarity_1 + alpha_2/similarity_2 + ... + alpha_N/similarity_N

Computationally, this appears to be easy-peasy. However, it just occurred to me and have no idea if it makes any sense whatsoever.

I see that it is equal factor weighting. May I know how to add weights to certain factors ie 30% ROA, 20% STA and 50% on the rest of the factors?

I am thinking at the "re-balance" function?

Thanks

@ Leo c , I believe you can assign specific weights to each factor as shown below within "def make_factors()". For additional information, Grant had posted this thread.

        return {  
            'ROA':              (ROA,0.30),  
            'STA':              (STA,0.20),  
        }  

FYI, I morphed my basic architecture to do alpha combination in before_trading_start (see https://www.quantopian.com/posts/alpha-factor-combination-in-pipeline-how-to-fancify-it#5bf525d4ab895f004b7011ab). My sense is that this is the place to do it, versus within Pipeline (the main reason is that one gets a full 5 minutes per trading day to do computations, should there be a need to do some computational fanciness; the other reason is that one can use normal Python versus having to use the Q proprietary Pipeline API--learning the nuances of the latter has limited universal applicability, and support is limited to the Q platform and its users).

@ Leo

The simplest way is to just do something like this:

def make_pipeline():

# Dictionary of factors  
factors = make_factors()

alpha = B1*factors['Factor1'] + B2*factors['Factor2'] + ... 

return Pipeline(columns={'alpha':alpha}, **kwargs)  

where (Bn) are the weights. I agree with Grant's post and also do more complicated computations in before_trading_start, but for simple algos I've found that doing the alpha combination within Pipeline and returning only the alpha column (instead of every factor) speeds up the backtest a bit.

Been fairly busy these days, will share some code on ML weighting when I get the chance.

@Grant @Adam @Daniel Many thanks

Amended post. I have one question on my custom factor testing. I tried to implement in Grant code but I have an error. How to make it right?

def make_factors(rng):  
    class LastTwoQuarters(CustomFactor):  
        # Get the last 2 reported values of the given asof_date + field.  
        outputs = ['q2', 'q1']

        window_length = 130

        def compute(self, today, assets, out, asof_date, values):  
            for column_ix in range(asof_date.shape[1]):  
                _, unique_indices = np.unique(asof_date[:, column_ix], return_index=True)  
                quarterly_values = values[unique_indices, column_ix]  
                if len(quarterly_values) < 2:  
                    quarterly_values = np.hstack([  
                            np.repeat([np.nan], 2 - len(quarterly_values)),  
                            quarterly_values,  
                        ])  
                    quarterly_values = quarterly_values[-2:]  
                    out[column_ix] = quarterly_values

    class LastThreeQuarters(CustomFactor):  
        # Get the last 3 reported values of the given asof_date + field.  
        outputs = ['q3', 'q2', 'q1']

        window_length = 130 + 65

        def compute(self, today, assets, out, asof_date, values):  
            for column_ix in range(asof_date.shape[1]):  
                _, unique_indices = np.unique(asof_date[:, column_ix], return_index=True)  
                quarterly_values = values[unique_indices, column_ix]  
                if len(quarterly_values) < 3:  
                    quarterly_values = np.hstack([  
                            np.repeat([np.nan], 3 - len(quarterly_values)),  
                            quarterly_values,  
                        ])  
                    quarterly_values = quarterly_values[-3:]  
                    out[column_ix] = quarterly_values

    FCFYLD = LastTwoQuarters(inputs = [Fundamentals.fcf_yield_asof_date, Fundamentals.fcf_yield], mask=universe)  
    LTDa = LastThreeQuarters(inputs = [Fundamentals.long_term_debt_asof_date, Fundamentals.long_term_debt], mask=universe)

    class TEST(CustomFactor):  
        FCF_YLD = FCFYLD.q1 + FCFYLD.q2  
        LTDaa = LTDa.q1 + LTDa.q2 + LTDa.q3  
        TEST_FACTOR = FCF_YLD/LTDaa  
        inputs = [TEST_FACTOR]  
        window_length = 252  
        window_safe = True  
        def compute(self, today, assets, out, TEST_FACTOR):  
            out[:] = preprocess(np.nan_to_num(TEST_FACTOR))  
    class TEST1(CustomFactor):  
        FCF_YLD = FCFYLD.q1 / FCFYLD.q2  
        LTDaa = LTDa.q1 / LTDa.q2 / LTDa.q3  
        TEST_FACTOR = FCF_YLD/LTDaa  
        inputs = [TEST_FACTOR]  
        window_length = 252  
        window_safe = True  
        def compute(self, today, assets, out, TEST_FACTOR):  
            out[:] = preprocess(np.nan_to_num(TEST_FACTOR)) 


    factors = [  
        TEST,  
        TEST1  
        ]  
    return factors[rng[0]:rng[1]]  

NonWindowSafeInput: Can't compute windowed expression TEST([NumExprFactor(...)], 252) with windowed input NumExprFactor(expr='(x_2 + x_0) / ((x_4 + x_1) + (x_3))', bindings={'x_4': RecarrayField([LastThreeQuarters(...)], 0), 'x_2': RecarrayField([LastTwoQuarters(...)], 0), 'x_3': RecarrayField([LastThreeQuarters(...)], 0), 'x_0': RecarrayField([LastTwoQuarters(...)], 0), 'x_1': RecarrayField([LastThreeQuarters(...)], 0)}).

@Grant, you simply use too many factors. You are going under the following equation taken from Lecture 32:

$$R_{a,t} = a_t + b_{a,F_1}F_1 + b_{a, F_2}F_2 + \dots + b_{a, F_K}F_K$$

and the factors of that equation are subject to a power law as the number of factors increases. Each factor you add contributes less and less to the overall result. To such an extent that your first 4 or 5 factors could represent some 90% of the data. Adding a sixth or seventh factor would generate just a marginal increase at best. It would be a small percentage compared to the 90% already covered. Imagine going for 10 or more factors. You do not want your trading system to depend on the fumes of the 11th factor.

See the following post for example.
https://www.quantopian.com/posts/the-capital-asset-pricing-model-revisited#5bf5966d3f88ef0a5ae55e38

or refer to Wikipedia: https://en.wikipedia.org/wiki/Power_law

My suggestion to improve on your strategy design would be to first reduce the number of factors. Even before that, I would question the relevance of the selected factors in the trading strategy you might want to use.

Hi Guy -

So do you have a specific recommendation on how to reduce the number of factors? Effectively, I think you are saying that the weights should not be equal (and that some of them should be zero). What's the recipe?

@Grant, as you know, I look for mathematical, statistical, and probabilistic solutions. I want an equal sign in my answers. And those are hard to come by in a trading environment.

Nonetheless, you have an old standby that could help: the payoff matrix.

$$F(t) = F_0 + \sum_i (H \cdot \Delta P) = F_0 + n \cdot u \cdot \overline{PT}$$

So, the numbers you need to impact are: n, u, and PT. Your trading strategy rebalances most of its stocks every day. Your program trades some 100 stocks some 84,921 times, with a starting trade unit of 100k, on which you make, on average, 0.10414%. That is $104.14 per trade. And since frictional costs are included, this is great.

Which recipe to use is not the main question. It is what can you do that can have an impact on those 3 numbers. Can you increase the number of trades over the same time period? Can you raise your trading unit higher? Currently, it is growing at about a 3.04% rate. Can you do better? Also, can you increase your average profit per trade? Those are the numbers you have to take care of.

Your (round_trip=True) setting show that you have a 2:1 profit on your shorts, but with a 33% hit rate. Can you keep the edge and raise the hit rate? On your longs, you get a nice hit rate, but you lose more on your losses than you gain on your average win. The ratio is 0.89:1. You make money on your longs because the hit rate is providing the edge. A lower hit rate and you would lose money on your longs. I did not bother about the break-even point, but you estimate that yourself.

You traded at least 8.5B to make your 8.8M profit. There is nothing wrong with that. And really, that is not important.

You could raise profits 10 times just by switching the stats of your shorts (my numbers say 97M). Like getting the 0.58 win rate on the shorts and keeping the same average profit to loss figures. It would be sufficient to exceed market averages, and win the contest your eyes closed. Find ways to make it happen. You could increase your profits even further by giving your longs the same treatment.

Thanks Guy - You provided some good pointers. --Grant

Grant, thanks for sharing. I've used lasso regularization to reduce the number of factors but I'm not sure it's possible in the quantopian framework.

@Grant, something else that could help you in your strategy would be to identify which of the factors you intend to use have value. That is easy to determine. It goes under the notion of all other things being equal.

For instance, I have said this often, I consider the stocktwits as having no value. Yet, it is part of your library arsenal.

If you decrease its importance by reweighing it to one hundredth the original test, it should reduce the strategy's payoff matrix. The same goes if you reweigh it to 100 times, you should see some improvement. If it results in no change to the payoff matrix, you have your answer.

That you do plus or minus a hundred times more or less than the initial value you will see no significant change in the payoff matrix. Confirming that the stocktwits added absolutely no value to your game. BTW, even if you did a million times less, you would still be with a no change scenario. In a way corroborating my argument on too many factors since the added ones beyond 5, 6 or 7 might have low marginal impact.

You could do this with any of the other factors you intend to use, reweigh them up or down one after another.

$$F(t) = F_0 (1 + \xi_1 b_{a,F_1}F_1 + \xi_2 b_{a, F_2}F_2 + \dots + \xi_K b_{a, F_K}F_K)^t$$

If each of the factors can have an impact, it will show. As you modify each of the \( \xi \) values in turn, you will see which factor might have some value. That is a lot of tests, but from it should come out which factors have any merit. From this, you could study the differential impact each factor can have. You could build more on those factors having a positive impact on the resulting payoff matrix since the others would have failed to raise the outcome.

Should you try to increase FCF 100-fold or reducing it by as much, in both cases, you will get a lower payoff matrix. You would make more if you reduced it than if you increased it. Yet, a factor like free cash flow should have a positive linear relationship to the well being of your payoff matrix. Either that or FCF is drowning in a sea of factors.

You might have to test every one of your factors in the same way in order to remove all that is either redundant or totally useless. It is up to you.

Thanks Guy -

In my mind, it is a question of how one should set the factor weights optimally (either as a "set it and forget it" approach or dynamically, computed with an online algo), given that there will be some turds in the bunch, deserving of zero weight, but other gems that justify non-zero weights.

Is it possible to calculate a trailing Sharpe Ratio for each factor (say for the last X days). Then use a Sharpe Ratio threshold of Y for including the factor in the latest alpha summation.

You could then optimise performance for X and Y and hopefully find a stable area where small changes in X and Y were not critical.

BTW - Thanks again! I used your framework to get my first contest entry accepted. I replaced all your factors with my own single factor.

I would very much like to use the above process in my own algo, but do not yet have the python ability.

Amber Swordfish

Update: I just tested this approach excel , optimising the Sharpe Ratio for 20 versions of the same custom factor (each with different custom factor parameters). All 20 had positive Sharpe Ratios over a 27 year backtest on a single stock (SPY).

When I used a trailing Sharpe Ratio on each custom factor to select which to include for the current day, I was surprised to find that the most successful strategy is to select factors which have a poor trailing Sharpe ratio.

On reflection, this makes sense as the Sharpe ratio for each custom factor will be mean reverting.

I can do all this in excel very easily - be great if somebody knows how to do in Python!

@ Ben -

I think the concern with a Sharpe Ratio (SR) weighting on a relatively short trailing window would be overfitting and/or churn (SR is very volatile). There's some discussion of this in Robert Carver's book, Systematic Trading; you need a very large trailing window of data to have any confidence that the SR is what you think it is. Especially for factors based on fundamental data, the number of data points is pretty small, due to the quarterly/annual frequency of fundamental data.

@Grant, the presented equation would still prevail.

$$F(t) = F_0 (1 + \xi_1 b_{a,F_1}F_1 + \xi_2 b_{a, F_2}F_2 + \dots + \xi_K b_{a, F_K}F_K)^t$$

You have \(K=40\) factors! You have no way to easily determine which are the turds and which are the gems, except by analyzing their contribution one at a time. And if you do that, one at a time that is, your results will not be conclusive. This has been demonstrated before and is what most have found over the years.

Your trading strategy is imposing 0.025 weights for each factor, thereby allowing each factor to represent 2.5% of total return which itself is less than 10%.

If one of the 40 weights changed by 10%, then its impact on the portfolio would be 0.0025. How will you be able to distinguish it from the whole with your current trading strategy? All it will do, considering all other things remaining equal, is to force a rebalancing of the entire portfolio with no fundamental reason as to why? All I see is churning on market noise where a single factor can force rebalancing, even on a 1% variation.

That you go dynamic or not, there are simply too many factors. You have thousands and thousands of tests to perform just to find the few that matter, and when you will have found them, you will be, to put it mildly, right up there, in over-fit territory. This would imply that your trading strategy will breakdown going forward. Totally counterproductive. And probably, a total waste of time. I did elaborate on this in my book.

Take at most 5 to 7 factors. There are 658,008 combinations should you take 5 factors at a time. And 18,643,560 different combinations should you go to 7 factors. Just in case you wondered. It would take some 4,000 years to do all the tests should you be up to it (considering the time it takes to do just one).

Already, Fama-French limited themselves, first to 3, then to 4 factors. They did not have better results going higher. At least, they did not show or commented much on that part of their research.

However, with your expertise, you can do better than they ever could with the added trading. Plan for what you want to see, and then make it happen. You are playing a math game and it has long-term exponential properties.

Though I've mentioned this before, I agree with Guy's statement that there are too many factors however that doesn't necessarily mean dropping them to some arbitrary 5-7 number. It does mean it is necessary to find a better way to weight each factor though, for precisely the reason that Guy gave that the factors are essentially competing against each other with noise.

While individually stress-testing each factor is a good idea, it is not very practical even with only 5-7 factors. The concept of bias-variance trade-off is well-studied in data science, and so a potential solution to this is to use something such as k-fold Cross Validation (i.e. dividing the data into k subsets, where the first is reserved as a test set, and the remaining k-1 is used as training set. This process then repeats k times and the average Mean Squared Error across all folds is returned). I would suggest testing this with 5 or 10 folds initially, then incrementally adjusting. This becomes a much more practical method for testing the model while typically optimizing the bias-variance trade-off.

Also if the set of factors have a high degree of multi-collinearity (as is likely the case, since some of the factors are the same but with different timespans), the results from individually testing factors would not necessarily be reliable as this would impose the parametric assumption that all other factors are kept constant. Two or more highly correlated factors would violate this assumption, and make the results of such a test unreliable.

Thanks Adam -

For the technique you suggest, would trailing windows of each factor (alpha vector) be required? Or would it be done on the point-in-time alpha vectors? It seems like as a first-go, one just needs the point-in-time vectors. One reason I'm asking is that I want to do the alpha combination in before_trading_start and it is relatively easy to access the vectors there. However, I don't yet have a framework for accessing trailing windows of vectors in before_trading_start.

@Grant

It's more of a tool for model validation, since we were discussing the possibility of overfitting and potential issue of too many factors. So it'll actually be something you do in Research, rather than deploy it into the actual algorithm.

If the model is misspecified, the mean MSE across the test sets would significantly exceed the expected value of (n-p-1)/(n+p+1), where n is your number of periods and p is the number of factors. In which case, we may drop factors one by one and see if the mean test MSE improves. Then take the factors that remain and put into your alpha combination framework. This would spare you from the 18,000,000+ combinations of individual stress-testing that Guy suggested, and narrow down the large list of factors to the most predictive ones.

Of course, the question remains as for the actual combination of the alphas. You could do an fundamental reasoning approach, or machine learning, or a multitude of others so I think this question is very open to individual judgment. Though based on the different timespans of the factors, I think a non-linear combination could work well.

Regarding the possible number of factors, it could could be quite large. Consider the degrees of freedom: thousands of stocks with OHLCV data and hundreds of fundamentals, tens of sectors, time of day/week/month/year, etc. (e.g. assuming one were privy to sufficient details, there could be one factor per stock). The idea that all of the degrees of freedom could be distilled into a handful of factors doesn't sound right. The concept perhaps applies to an investment portfolio of mutual funds/ETFs, but given the number of degrees of freedom in the data, it is not clear that the same rule-of-thumb applies.

The other problem, I suspect, is that all factors don't work at all times--there is transience. But if one ignores the transience, and just picks factors that appear to be all-season, then a lot is left on the table.

I tried machine learning algorithm to build a classifier of those alphas.... but it looks like linear combination is the best choice...

@Adam, Nomura quants acknowledged 314 factors in the document you referenced. Take their equation on page 17, \(r_{i,t} = a_0 + \beta_1 . V_{i,t} + \beta_2 . M_{i,t} + \gamma . \Im (V_{i,t} \cdot M_{i,t})\) where they considered only 2 factors at a time with their possible partial correlations. That would be 49,141 possible combinations. If they wanted to select the best 2, they would know which ones were best only after all those 49,141 tests. Just to point out that it would still be a lot of tests to be done.

Since the factors obey a power law, each factor that is added will increase the coefficient of determination and as a consequence increase the number of tests to be performed. See the following post which elaborated on this point.

Should Nomura quants have tried to combine 7 factors using the same principles, in an exhaustive search, they would have had to try 55,825,075,869,992 combinations. Even if they considered using dynamic functions, they concluded to limit themselves to simple linear regressions on 2 factors without specifying how many tests were performed on how many different factors.

On the same basis as their equation, you could write: \(r_{j,t} = a_{0_{j,t}} (1+ g_{j,t})^t \) where the average secular market trend g(t) might outperform the Nomura factors or account for much of it. Are the Nomura factors detecting a major part of the secular trend? Because on that one, you might have had it all just buying low-cost index funds. I was not impressed by that document. One should go for more in a long-term active trading strategy.

@Guy

Great point. What I meant to say was that rather than individually weighing/test each factor (similar to a sensitivity analysis in traditional finance), there are other methodologies to test for model misspecification and/or overfitting that may be more practical in this case. My previous post mentioned k-fold cross validation as an alternative method, since the large amount of tests necessary and potential multi-collinearity in the factors are a concern. Repeating the cross validation test by removing factors would require at most 39 tests.

The Nomura document used a "learning process" (presumably the tree method introduced in the earlier slides) to determine optimal weights for each of their 23 factors. Since this process allows for 0 weights, it can also serve the function of removing the non-predictive factors - at least by sector. Thus the lengthy process of individually testing each factor with the trillions of combinations is bypassed.

@Grant

I completely agree. The idea that the entire cross-section of thousands of stock returns can be explained by a handful of factors seems very unlikely to me as well. Not to mention that even with the hundreds of possible factors already identified, we should also consider the countless possible linear and non-linear interactions between factors.

This brings up the topic of the performance of an all-season algorithm vs an algorithm that adapts its strategies to different market conditions. The Nomura model uses different factors and weights for each sector, and performs relatively well against their benchmark. Personally, I think rather than trying to create a general model that works across the entire cross-section (as is done by Sharpe, Fama-French, etc) which tends to have more profound academic results on market efficiency, it may be more profitable to try to create a model that works for a specific subset and to choose factors with this goal.

@ lifan -

I tried machine learning algorithm to build a classifier of those alphas.... but it looks like linear combination is the best choice...

That's interesting. Not sure what it says. One consideration is whether there is enough data, given the time scales of the factors (particularly the fundamental ones). We are dealing with < 10 years of data, when ~50-200 years would be preferred. I have to wonder if the algo I posted above is grossly over-fit, given both the shortness of the backtest, and the fact that the 2010-present economic context is historically unique (immediately following a huge debt global crisis and governmental intervention). So, a ML approach, given insufficient data, will more-than-likely produce inferior results (even though perhaps the ML approach would be superior given sufficient data).

@Grant, by the way, as for debtToTotalAssets factor, it think it should reverse, the less the better, is it correct?
As for machine learning part, I figured out, the train datasets I use is too small, I updated the model too often...

Here's an update. As I mentioned in the original post, I'm mostly concerned with the algo architecture. Note that I output the individual daily alpha factors/vectors to before_trading_start and do the combination there (with some additional weighting, based on cosine similarity). Also, note that I included the possibility of non-zero static weights in Pipeline (by returning tuples, (factor, weight), with the weights presently all set to 1.

Eventually, I'd like to output trailing windows of alpha vectors to before_trading_start so if anyone has done this, I'd be interested. One approach would be to assign each factor its own Pipeline, but my understanding is that this might really bog things down.

Seem like new method is holding well wrt to your first post

Grant,

I believe initialize() is called once at the very beginning of the backtest/algorithm start, and perhaps you could schedule a recording function that runs at the end of trading period. Something like:

def initialize(context):  
    # Empty list to store dictionary of alphas  
    context.trailing_alpha = []  
    schedule_function(alpha_record,  
                      date_rule=every_day(),  
                      time_rule=market_close() )

def before_trading_start(context,data):  
# Do your alpha combination here, stored as context.combined_alpha

def alpha_record(context,data):  
""" Combined alpha is appended to the list at end of each day """
    context.trailing_alpha.append(context.combined_alpha)

def rebalance(context,data):  
# Let n be trailing window. If days >= n, do something with context.trailing_alpha.

@ Adam W -

Yes, recording (accumulating) the alpha factor data as the algo runs is a possibility, however then one has to deal with the dead or transient period until enough data are accumulated. Also, I don't know if it'll do much good if the combined alpha is available in a trailing window. I think one wants N trailing windows--one for each of the N individual factors.

@ lifan guo -

by the way, as for debtToTotalAssets factor, it think it should reverse, the less the better, is it correct?

Not sure. You might try Alphalens to see what it does. I don't have a business/finance background, but my hunch is that debt may have more of a sweet spot for a given business. The other thing to consider is that all of the financials are public information, so the question really is to what extent has the market accurately incorporated debt relative to total assets into the price of a given company. There may be a tendency for companies with high debt levels to have their stock prices beaten down to inaccurate levels, and so when the debt doesn't lead to disaster, the stock price rises. For example, this article talks about Sears as having debt problems:

https://www.investopedia.com/terms/t/totaldebttototalassets.asp

So, as a hypothesis, maybe investors are avoiding Sears like the plague, when in actuality, it is more valuable than is reflected by the stock price.

For folks interested in dinking around with lots of factors:

https://www.quantopian.com/posts/alpha-compiler
http://alphacompiler.com/

Oh I see. Just an idea, perhaps something like this? Haven't tested it.

def make_factors():

    def alpha_combination():

        class Factor1(CustomFactor):  
            window_length = n1  
            def compute(self, today, assets, out, *inputs):  
                out[:] = # Do something  
        class Factor2(CustomFactor):  
            window_length = n2  
            def compute(self, today, assets, out, *inputs):  
                out[:] = # Do something

        return # Combine Factor1 and Factor2

    return {'alpha_combination':alpha_combination}  

Then call make_factors within Pipeline to return alpha for each day based on N factors with (n1,n2,...) trailing windows.

@ Adam W -

One limitation with Pipeline is that it will only output a Pandas DataFrame (indices are stocks and columns are data, e.g. alphas). I've been muddling my way through how to get data out to do alpha combination outside of Pipeline (see https://blog.quantopian.com/a-professional-quant-equity-workflow/ for an overall flow). I discuss some ideas on https://www.quantopian.com/posts/alpha-factor-combination-in-pipeline-how-to-fancify-it.

Within memory constraints, one has a full 5 minutes per trading day to do computations within before_trading_start (versus 10 minutes per chunk in Pipeline, which results in much less than 5 minutes per day of available compute time).

So, my thinking is that any dynamic alpha combination process should be done in before_trading_start by operating on the alpha factors (and any other data output by Pipeline for the combination step).

Is it possible to compute X hours before the actual trading? let everything compute finish and store the chosen assets and their weights in the system. When the trading comes, then just upload to trade?

Hi Leo c -

Is it possible to compute X hours before the actual trading? let everything compute finish and store the chosen assets and their weights in the system. When the trading comes, then just upload to trade?

In a nutshell, yes. When the algo starts, the weights should be based on t <= 0, where the algo starts at t = 0. It is possible, and is the typical approach.

@Grant,

If so, we wont face any execution time out and memory overflow? The system can take its own sweet time to compute?

@ Leo c -

The limits are 10 minutes for all Pipelines (chunked, not per day) and 5 minutes/day for before_trading_start (and 50 seconds for anything run during the day, during a trading minute). However, I just encountered an unexpected mysterious initialize timeout of < ~ 2 minutes, which seems to trump the Pipeline and before_trading_start timeouts for the use case I'm trying.

Not sure about memory...there is a limit.

@ Grant

That's very interesting. I haven't experienced any time-outs on my algorithms as I've been using static weights (trained in the Research API) but I can see how dynamic weights may require optimizing the architecture.

Could you elaborate more on what you mean by Pipeline chunks? All I could find in the documentation was the *chunksize argument which was for run_pipeline() on the Research API.

Edit: Q's FAQ mentions the module Queue is supported. Perhaps some sort of multi-threading may be useful for complex computations within the time window?

Hi Adam W -

There's some discussion about Pipeline and chunking here:

https://www.quantopian.com/posts/pipeline-timeoutexception-any-hope-for-a-fix
https://www.quantopian.com/posts/before-trading-start-timeout-fix

There's chunking in notebooks (where it can be controlled) and in the backtester (where I think it is fixed at 126 trading days).

In the backtester, there was a fairly recent change that provides 10 minutes for Pipeline to execute a chunk, and a separate 5 minutes for any computations in before_trading_start. So, I've been tinkering around with computing trailing windows of factors in Pipeline and then exporting the results to before_trading_start for the alpha combination step (see https://www.quantopian.com/posts/alpha-factor-combination-in-pipeline-how-to-fancify-it).

Grant
Initial standardization of alpha factors. I can see that you center each alpha factor and then add all the alpha factors together. I am working on looking at the alpha factors in the research environment but thought to ask you a question: are the alpa factors (prepared in this form) sufficiently homogeneous to be able to add together in this way?

I can see that ranks can be added but I am not so sure about this wide spread of alpha results.

Hi Zeno(Anthony) , Grant,

Been awhile. Standardization and normalization of data (factors) through rank and zscore are pre-processing data transformations that aligns the distribution to a Bayesian (Stochastic) framework (Gaussian distribution). This is a dominant assumption in financial modeling within the linear, classical statistics analytical framework. At the center of the problem is some kind of weighing scheme of factors and / or combined factors. There are various techniques of weighing factors as Grant has been attempting of late through Spectral Clustering and also others using ML and other non linear methods. Other schemes involves weighing relative to some other threshold factor like risk parity, mean variance, Sharpe, etc. So the challenge here is pre-processing raw input data in something that is meaningful to the weighing scheme / process, whatever you choose under whatever assumptions you have , to try and achieve an optimal alpha ( or objective function) without overfitting! This is where the bulk of the work is, finding the right combination that is sustainable and that takes a lot of trial and error.

Happy NewYear!

James, well said.

Hi Zenothestoic and James -

I've been kinda "meh" on Q of late and have taken a break. That said, I'm glad to comment if folks have questions on material I've posted.

I think the question is, what is the justification for applying sklearn.preprocessing.scale to each alpha vector and then summing as a means of alpha combination? Frankly, I haven't given it much thought and have forged ahead with what seems to be common practice, based on various postings to the Q forum. I've played around with summing ranked alpha factors, but the results don't seem as good.

I'd note that z-scored data can either conform or not to a Gaussian/normal distribution. If the distribution is skewed or multimodal or whatever, z-scoring doesn't patch things up into a tidy normal distribution. This suggests that perhaps the distribution of data of each alpha vector could be analyzed first, prior to normalization, to determine how to normalize it.

It is also worth noting that "there is nothing new under the sun" as evidenced by the variety of tools available here:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

There is a comparison of methods here:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

@Guy, thanks!

@Grant,

I'd note that z-scored data can either conform or not to a Gaussian/normal distribution. If the distribution is skewed or multimodal or whatever, z-scoring doesn't patch things up into a tidy normal distribution. This suggests that perhaps the distribution of data of each alpha vector could be analyzed first, prior to normalization, to determine how to normalize it

Yes, this is called trying to fit a "square peg in a round hole" , a common pitfall when the problem is postulated in a Stochastic Bayesian context. Your suggestion is a step in the right direction given data distribution of various factors have different spread/ranges/frequencies of occurences.

I am not sure then that we appear to be a single inch further forward. The whole of quantitative finance (or the vast majority of it) is based on the misconception that stock returns are normal.

Ranking makes much more intuitive sense to me, whether (or not) subsequent zscores of those ranks are appropriate. It seems a more factor neutral method of initially ranking alpha factors than an initial zeroing of alphas with (potentially) vast spreads around the mean in some factors and much narrower spreads in other factors. But perhaps Grant's subsequent processing ameliorates that.

It did occur to me that this is (potentially) why this method appears superior to the ranking method used elsewhere but I have done no proper investigation as yet. Perhaps this method gives hidden favor to more widely dispersed measures of alpha?

That aside, I feel "dubious" about the entire optimization process. Or rather about the vast list of constraints which actually leaves very, very little wriggle room for alpha at all.

In particular I am amused or perhaps amazed at the sheer size of the constraints list we are all using.

risk_model_loadings alone has what, 32 parameters?

Add to that constraints for:

MaxGrossExposure
NetExposure
DollarNeutral
PositionConcentration

and you rather wonder what we are left with.

A Total Neutral Portfolio perhaps? Sometimes I wonder if it might be best to stay out of the market altogether and merely put the funds in the money markets since it seems that every element of risk is diversified away.

Or at least the attempt to hedge all risk is made.

Anyway the journey is an interesting one and like life it may be that the journey is the whole point and the inevitable end is actually not relevant.

Grant

Like you I go "meh" on Quantopian sometimes. I do not think I am likely to be able to produce what Big Steve Cohen wants: a vast risk free return achieved by leveraging potentially non existent "alpha". What I do enjoy here is the quality of the research even if it leads to no good end.

As I have stated many times, I believe the consistent, successful prediction of financial instrument returns over the long period is not an achievable goal. But its fun making the attempt. Even if its profoundly unprofitable.

The profitable aspect is in the fees. However mediocre the investment performance, if you are good at marketing, you will win handsomely by managing other people's money.

Unless you are trading off the back of something which greatly increases the probability of correct prediction (inside information, the bid/offer spread, whatever) in the long term hedge fund clients will suffer. They will be paying absurdly high fees for very average performance. Or in some cases worse as with Jabre's recent collapse.

But as I have already said, for me the fun is the intellectual puzzle and grasping with new concepts.

@Zeno,

That aside, I feel "dubious" about the entire optimization process. Or rather about the vast list of constraints which actually leaves very, very little wriggle room for alpha at all.

I believe this is by design and as intended given the specific trading / portfolio strategy Q is looking for, a low volatility, dollar/market/sector neutral, common risks mitigated (by their definition), long/short implemetation that is preferably highly diversified! So yes given these constraints there is very little wiggle room for alpha but the niche is with this level of tradable funds that is considered to provide liquidity to the market, transaction and borrowing costs are very minimal and coupled with the power to leverage which can magnify returns while containing risks at manageable levels. This is very consisent to how Steve Cohen operates his hedge funds.

This suggests that perhaps the distribution of data of each alpha vector could be analyzed first, prior to normalization, to determine how to normalize it.

Yes, quite. But assuming each or many alpha vectors have very different distributions is it possible to normalize each of them in different ways so as to make them comparable? Use a different method to normalize each to between -1 and 1 for instance? Are there that many ways? Is this even possible?

It did occur to me that this is (potentially) why this method appears superior to the ranking method used elsewhere but I have done no proper investigation as yet. Perhaps this method gives hidden favor to more widely dispersed measures of alpha?

Just a hunch, but yes, I think something like this is going on. Ranking wipes out any distribution in alpha values altogether and creates a new distribution (that is the same for every alpha factor). For a large number of stocks, one ends up with uniform itsy-bitsy percentage-wise differences in ranked values, so distribution tails aren't really driving returns as they would for un-ranked alpha factors (where there is no limit on the extremes of the z-scores).

The idea of combining z-scoring and ranking doesn't really make sense, by the way (Q has done it in some of their examples, and it always seemed goofy). If ranking is the chosen method, then for a common stock universe across all alpha factors, the ranks will already be normalized.

If ranking is the chosen method, then for a common stock universe across all alpha factors, the ranks will already be normalized.

Ah! Yes of course. Silly of me.

As I have stated many times, I believe the consistent, successful prediction of financial instrument returns over the long period is not an achievable goal. But its fun making the attempt. Even if its profoundly unprofitable.

Not so sure. It's probably like any other business. No one player will be successful forever, but there's a lot of money sloshing around, and at any given point in time, there will be opportunities. Nobody's selling buggy whips anymore, but smart phones are still going strong (for now).

Btw Grant, in one of your custom factors:

class Gross_Income_Margin(CustomFactor):  
    inputs = [Fundamentals.cost_of_revenue, Fundamentals.total_revenue]  
    window_length = 1  
    window_safe = True  
    def compute(self, today, assets, out, cost_of_revenue, sales):  
        gross_income_margin = sales[-1]/sales[-1] - cost_of_revenue[-1]/sales[-1]  
        out[:] = preprocess(-gross_income_margin)  

Is the definition of gross_income_margin the same as gross profit margin?

Thanks

Hi Karl -

Is the definition of gross_income_margin the same as gross profit margin?

It would seem so, but you might want to dig into the definitions of sales and cost_of_revenue just to make sure.

Thanks, Grant - the equation seems to be at odds:

gross_income_margin = sales[-1]/sales[-1] - cost_of_revenue[-1]/sales[-1]  

Edit: As that would be = 1 - cost_of_revenue[-1]/sales[-1] Or should it be:

gross_income_margin = (sales[-1] - cost_of_revenue[-1]) / sales[-1]  

There was some whacky reason I coded things as I did. I vaguely recall that it helped with managing NaNs/Infs. If sales = 0 but cost_of_revenue > 0 then gross_income_margin is NaN instead of Inf for the way I wrote the equation (since NaN - Inf = NaN, I believe).

This may be unnecessary, since I take care of both NaNs and Infs here:

def preprocess(a):  
    a = a.astype(np.float64)  
    a[np.isinf(a)] = np.nan  
    a = np.nan_to_num(a - np.nanmean(a))  
    a = winsorize(a, limits=[WIN_LIMIT,WIN_LIMIT])  
    return preprocessing.scale(a)  

Now that I think about it, this approach may be mucked up. Gotta run...I'll sleep on it and get back to you.

@ Karl -

Just let me know if you are stuck on anything. I've been more-or-less tuning out of Q for a variety of reasons, but have a commitment to support others (so long as it doesn't involve an extensive time commitment).

Thanks for getting back, Grant - ditto I was meaning to help if the gross_income_margin might be a dud - noting your earlier comment.

No worries :)