alpha factor combination in Pipeline - how to fancify it?

Quantopian's community platform is shutting down. Please read this post for more information and download your code.

Back to Community

alpha factor combination in Pipeline - how to fancify it?

posted Sep 12, 2018

Seeking Help Tools and Tips Pipeline

Anyone have guidance on how to combine a set of Pipeline factors using something fancier than the typical sum-of-z-scores method? For example say I wanted to weight each factor by dividing by the variance in its returns? Or compute the minimum variance portfolio of factors? Or perhaps compute the information coefficient for each factor, and use it to drop/add factors?

28 responses

Just found this in the help:

https://www.quantopian.com/help#quantopian_pipeline_CustomFactor

class MultipleOutputs(CustomFactor):  
    inputs = [USEquityPricing.close]  
    outputs = ['alpha', 'beta']  
    window_length = N

    def compute(self, today, assets, out, close):  
        computed_alpha, computed_beta = some_function(close)  
        out.alpha[:] = computed_alpha  
        out.beta[:] = computed_beta

# Each output is returned as its own Factor upon instantiation.  
alpha, beta = MultipleOutputs()

# Equivalently, we can create a single factor instance and access each  
# output as an attribute of that instance.  
multiple_outputs = MultipleOutputs()  
alpha = multiple_outputs.alpha  
beta = multiple_outputs.beta

So, perhaps for each factor I can output the factor itself and some statistics that could be used to combine it with other factors?

@Grant,
There are two ways...both of which you've participated in before...so perhaps this is not what you want.

A. Use the expression compiler(Pipeline) to combine factors that run across the data.
The best example I know of that does this is variants of the ML pipeline.
I use this, and until it gets passed over the data in before_trading_starts, it's just a big expression of Pipeline code.
Here is the business end of specifying combining a bunch of factors by applying a function to the array of factors.
https://www.quantopian.com/posts/machine-learning-on-quantopian-part-3-building-an-algorithm

def make_ml_pipeline(universe, window_length=21, n_forward_days=5):  
    pipeline_columns = OrderedDict()

    # ensure that returns is the first input  
    pipeline_columns['Returns'] = Returns(  
        inputs=(USEquityPricing.open,),  
        mask=universe, window_length=n_forward_days + 1,  
    )

    # rank all the factors and put them after returns  
    pipeline_columns.update({  
        k: v.rank(mask=universe) for k, v in features.items()  
    })

    # Create our ML pipeline factor. The window_length will control how much  
    # lookback the passed in data will have.  
    pipeline_columns['ML'] = ML(  
        inputs=pipeline_columns.values(),  
        window_length=window_length + 1,  
        mask=universe,  
    )

    pipeline_columns['Sector'] = Sector()

    return Pipeline(screen=universe, columns=pipeline_columns)

B. Take the output of
context.output = algo.pipeline_output("pipe") as an output dataframe, and combine (alpha factors) columns using all the methods you want(e.g. dataframe methods), to create an alpha factor combination as a new column...usually done in something like a rebalance method.

alan

Thanks Alan -

Your "B" above might be the preferred approach, but my understanding is that Pipeline does not neatly support output of a trailing window of values (if I'm wrong, someone please correct me). One can only output a Pandas DataFrame with row labels corresponding to the current universe, and column labels applied within Pipeline. I suppose one could create a column for each trailing day. For example, for 2 alpha factors and 5 days, I would have columns with labels:

alpha00, alpha01, alpha02, alpha03, alpha04,  
alpha10, alpha11, alpha12, alpha13, alpha14

In a similar fashion, I would add columns for returns:

ret00, ret01, ret02, ret03, ret04,  
ret10, ret11, ret12, ret13, ret14

Then, in before_trading_start I would do the alpha combination. Alternatively, the alpha combination step could be postponed until during the trading day, if it could be done within ~50 seconds, and it were advantageous (e.g. generally, the alpha combination computation could include mid-day minutely OHLCV data).

This seems workable (and would seem to be the only data structure supported by Pandas anyway, since Pandas Panel has been deprecated...although I see that the non-Pandas xarray has developed an alternative).

Doing the potentially computationally expensive alpha combination within before_trading_start (with a full 5 minutes allocated per day) would seem to be the way to go, versus the chunked computations within Pipeline, over a 10 minute limit (where, for back testing, one gets nowhere near 5 minutes per trading day for computations).

Another potential approach would be to create a separate Pipeline for each trailing day, but this seems very awkward, and my understanding is that one could bump into overhead issues, since each Pipeline runs independently with respect to fetching chunks of data (I think).

@Grant,
Good idea for B!...I'll have to try that!

For A., it's a bit of magic to me, as it seems like Pipeline is a kind of dataflow compiler, and I could never figure out why inputs=pipeline_columns.values() works in ML above, in that the other columns are computed and ready for the ML method to be run. How does that happen?

alan

@ Alan -

I don't think I'll try to sort out how to do the alpha combination within Pipeline, since it would seem advantageous to do it in before_trading_start. Generally, the more I can do in common Python, I think the better off I'll be. I'm sure that the Q Pipeline API is pure wonderfulness, but if it is not used widely, and I can't Google, etc. for help, I lose patience.

Thanks for your feedback, by the way. Somehow, it made me realize that I could get the data required out of Pipeline.

Perhaps try, Grant a routine I use in Pipelineto append data by columns:

steps = np.arange(start, end, interval)  
for s, step in enumerate(steps):  
    a = myCustomFactor(inputs=[myInputs], window_length=step, mask=myUniverse)  
    myPipeline.add(a, 'alpha' + str(s))

Thanks Karl -

I'll need something like your code to automatically label the alphas and anything else to be spit out by Pipeline.

An example of a possible data structure mapping is given in the "Deprecate Panel" section of https://pandas.pydata.org/pandas-docs/stable/dsintro.html, in the form of a MultiIndex DataFrame:

                     ItemA     ItemB     ItemC  
major      minor  
2000-01-03 A     -0.390201 -1.624062 -0.605044  
           B      1.562443  0.483103  0.583129  
           C     -1.085663  0.768159 -0.273458  
           D      0.136235 -0.021763 -0.700648  
2000-01-04 A      1.207122 -0.758514  0.878404  
           B      0.763264  0.061495 -0.876690  
           C     -1.114738  0.225441 -0.335117  
           D      0.886313 -0.047152 -1.166607  
2000-01-05 A      0.178690 -0.560859 -0.921485  
           B      0.162027  0.240767 -1.919354  
           C     -0.058216  0.543294 -0.476268  
           D     -1.350722  0.088472 -0.367236  
2000-01-06 A     -1.004168 -0.589005 -0.200312  
           B     -0.902704  0.782413 -0.572707  
           C     -0.486768  0.771931 -1.765602  
           D     -0.886348 -0.857435  1.296674

Using this example, the columns would be SIDs, the major axis would be some sort of datetime stamp, and the minor axis would be the alpha factors along with any additional data required to combine them (N alpha factors, followed by M data sets).

I've never bothered with it, but presumably Pipeline has datetime stamps? I'll have to dig into that one.

Pandas API Reference for MultiIndex and Advance Indexing methods.

Thanks Karl -

That's quite a reference, with lots of examples! I'll have to search around for examples of how to do this:

DataFrame --> MultiIndex

Then I can back out how to construct the DataFrame in Pipeline so that I can create the MultiIndex in before_trading_start without too much fuss.

One possible set of Pipeline columns would be (datetime stamp followed by two alpha factors and the returns of each SID):

dt0, alpha00, alpha01, r0, dt1, alpha10, alpha11, r1, dt2, alpha20, alpha21, r2

Then, this would be converted into a MultiIndex, with the major axis the datetime stamps, and the minor axis the alphas and the SID returns, with SIDs as the column labels.

Karl -

So are you exporting only point-in-time alpha factors from Pipeline? Or a trailing window of alpha factors?

Point-in-time alpha columns, Grant from Pipeline upon start date then append subsequent point-in-time data columns into a context.DataFrame for global access throughout the backtest period until end date, and beyond for OOS point-in-time where relevant.

Quite simple to initialise a DataFrame in def initialise():

context.myAlpha = pd.DataFrame(np.nan, index=range(0, 1), columns=['alpha0', 'alpha1', 'alpha2', 'alpha3'])

Thanks Karl -

The thing is, when you start a backtest or live trading, you have to wait N days to fill whatever trailing window you need. If I wanted a year's worth of data, then I don't think building up the data set would be workable. The trailing window needs to come out of Pipeline from the get-go.

Another degree of freedom to crack this nut would be to have a separate pipeline for each alpha factor, to help with the bookkeeping. Then one could output lagged values of the factor along the columns of its pipeline output (e.g. see https://www.quantopian.com/posts/get-lagged-output-of-pipeline-custom-factor). Then, everything could be pieced together in before_trading_start.

That's creative thinking, Grant :) I may try that too!

The attached algo is an example of how to do simple alpha combination in before_trading_start. Each alpha is a daily vector stored in the pandas dataframe as alpha_N. This is accomplished with:

def factor_pipeline():  
    factors = make_factors()  
    pipeline_columns = {}  
    for k,f in enumerate(factors):  
        pipeline_columns['alpha_'+str(k)] = f(mask=QTradableStocksUS())  
    pipe = Pipeline(columns = pipeline_columns,  
    screen = QTradableStocksUS())  
    return pipe

If a trailing window of alpha values is needed in before_trading_start then one tidy approach would be to use a separate pipeline for each alpha factor. Then each column could correspond to a set of lagged alpha values. For example, for the Nth alpha factor (e.g. N = 2), the columns could be labelled (for a 5-day window):

alpha_2_0, alpha_2_1, alpha_2_2, alpha_N_3, alpha_2_4

The announcement for multiple pipelines is https://www.quantopian.com/posts/multiple-pipelines-available-in-algorithms. It is worth noting:

Multiple pipelines can easily lead to a slowdown in your algorithm, because the pipeline machinery can optimize your data fetching within a single pipeline, but does not optimize data fetching across separate pipelines. In general, it's better to use a single pipeline.

Note also this announcement: https://www.quantopian.com/posts/before-trading-start-timeout-fix. It basically says that for all pipelines combined, 10 minutes per trading day is allocated (however, large chunks of data are processed, as discussed here). The change also enables a full, predictable 5 minutes per trading day for before_trading_start (and so, my thinking is that alpha combination should be done there, and not in pipeline, even though it is awkward to get the data out).

On your thinking, Grant that:

for before_trading_start ..that alpha combination should be
done there, and not in pipeline

It works fine in before_trading_start when the scheduled_function is daily (as in yours) though I noticed before_trading_start keeps running everyday even if the schedule_function is weekly.

In algorithms where scheduled weekly, perhaps alpha combination (if outside Pipeline) may have move further down in the order of work flow.

Hi Karl -

The function before_trading_start is special in that a full 5 minutes is allocated for it to complete every trading day. During the trading day, in any given minute, everything has to be done in ~ 50 secs.

You can still use before_trading_startfor less frequent trading. For example, say you wanted to rebalance every Thursday. Schedule a function for Wednesday just to set a flag for the next day, context.run_alpha_combination = True. Then in before_trading_start run if context.run_alpha_combination = True and set context.run_alpha_combination = False once the run is complete.

Here's a first-cut at assigning each alpha factor its own pipeline, for the purpose of exporting a trailing window of each alpha factor to before_trading_start. For the example, I've limited the number of factors to 2, and the length of the trailing window to 5 days.

Here's a different approach that uses a single Pipeline for all of the factors (more in line with the Q guidance on scaling with the number of Pipelines). Oddly, if I scale up N_FACTOR_WINDOW I get an timeout for initialize within less than 2 minutes, which is sorta odd. I didn't realize that there is a timeout for initialize.

Grant, interesting idea. Please note that attaching multiple pipelines to an algorithm can introduce inefficiencies if you aren't careful.

The thing you want to avoid is attaching multiple pipelines that use the same data fields. For example, if you have multiple alpha factors that make use of USEquityPricing.close.latest, you should perform all logic that needs that particular data in one pipeline (or in post processing in before_trading_start()).

The problem simplifies to this: If you attach two pipelines that use the same data field, the entire dataset will be loaded twice.

Thanks Cal -

My objective here is to get data from Pipeline into before_trading_start where it would be processed. I think I'll abandon the idea of using a Pipeline per factor, and output all of the factors in one Pipeline (each of M factors having a trailing window length of N). And there will be K stocks in the QTradableStocksUS. So, we are talking about a Pandas dataframe with MxN columns and K rows . So what is a reasonable limit on MxN (given that K is fixed as the number of stocks in the `QTradableStocksUS')?

Also, is there no way to set the chunk size in Pipeline in the backtester? Presumably this would help with memory management (at the expense of more data loads, I suppose, which, if taken to an extreme, would cause a timeout of Pipeline?).

Hi Cal -

Another question - would multiple Pipelines help in memory management? My thinking is that the Pipelines must be processed serially. Data are read into a buffer, processed, and the results stored in an output buffer. Once all of the Pipelines have been processed, then the algo starts outputting the results to before_trading_start. So, if the input buffer memory space is recycled as the algo steps from one Pipeline to the next, then it should be easier to avoid memory limitations, right?

Here's an architecture I've been developing. The basic idea is to output trailing windows of each of N alpha factors computed in Pipeline, and do the alpha combination in before_trading_start. I've managed to get clustering to work, which is kinda cool. My thinking is that the clustering would be used in a factor weighting scheme. For example, say there are five factors, alpha_0, alpha_1, alpha_2, alpha_3, alpha_4, and they are clustered like this:

Cluster 0: alpha_0, alpha_3, alpha_4
Cluster 1: alpha_1, alpha_2

The factors would then be combined something like this:

alpha = wc_0*(w_0*alpha_0 + w_3*alpha_3  + w_4*alpha_4) + wc_1*(w_1*alpha_1 + w_2*alpha_2)

The individual factor weights (e.g. w_0, w_1, ...) would be static, and based on some global, long-term analysis if the value of each factor in the portfolio (e.g. from Alphalens or some such thing). The cluster weights (e.g. wc_0, wc_1, etc.) would be dynamic and based on the short-term clustering result.

EDIT: See post below for a minor update to the code.

Joakim Arvidsson (Cream Mongoose)

Man that is just super cool! Are you thinking a ‘Value Cluster’, a ‘Momentum Cluster’, etc?

What are you thinking of using to determine the dynamic cluster weights? And what trailing window would you use (or would this be dynamic as well, based on something, say market volatility)?

A few options I can think of: Absolute Returns, Risk-adjusted Returns, Volatility of Returns (inverse weight), or IC (mean or risk-adjusted).

Not really sure how to get any of that in Pipeline or anywhere in the IDE for that matter, so I’m not much help unfortunately. Maybe you or someone else know?

Other things to consider may be ‘momentum weight’ or ‘reversal weight.’ If a cluster has done really well during the trailing window, will it continue to do well, or will it ‘reverse?’ Is the trend increasing or decreasing, from a high or low value? Maybe use some sort of smoothing (eg SMA or EWMA) of whatever you’re using? Maybe use both a ‘momentum weight’ (longer trailing window) and a ‘reversal weight’ (shorter trailing window)?

Pretty obvious stuff maybe, just sharing my thoughts. :)

Thanks Joakim -

One thought is that clustering may be a way of avoiding having too much weight on a given alpha source, should multiple factors be tapping into it. For example, have a look at https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html. You can see that companies in similar industries are clustered together. The same should be true of alpha factors, I figure (even if the reason for the clustering may not be so easy to sort out). Basically, I think one could think of all of the factors that land in a given cluster to be from the same hidden sector.

Here's a minor update. I'm still getting up the learning curve, but I think the clustering should be done like this:

clustering = SpectralClustering(n_clusters=3,assign_labels="discretize",random_state=0).fit(alphas_flattened)

SpectralClustering takes the data and creates an affinity matrix (whatever that is...), not a similarity matrix.

I published a complete algo on https://www.quantopian.com/posts/alpha-combination-via-clustering. Thanks all for the helpful input.

You've successfully submitted a support ticket.

Our support team will be in touch soon.

Need help? Contact support.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian.

In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

About Quantopian

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian.

In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.