Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Introducing the Pipeline API

Today we want to introduce you to the Pipeline. This new API opens up a world of new algorithm opportunities. You can now dynamically select portfolios from the entire universe of 8000+ securities enabling larger portfolios and easy evaluation of both long and short positions.

The Pipeline enables you to evaluate 8000+ securities on a daily basis, using price, volume, fundamental data, and soon will include all data purchased from the Quantopian Store. You can calculate factors (scalar values) and filters (boolean values) to rank and select the stocks you want to track in real time throughout the day. All of the data calculated in your pipeline can be used during the day's trading.

The full documentation is here. Some highlights that you should know:

  • Factors and filters are calculated using daily pricing, volume and fundamentals data. In backtesting, they are calculated in bulk, once per year. Although the values are calculated in advance, the platform passes the values needed, when they are needed, to avoid any look ahead bias. This allows the system to be performant in the face of thousands of calculations each day.
  • We've started working on a library of factors for you to get started with. These include simple moving average, VWAP and RIS. This library will continue to grow, and as everything is in Zipline, you should feel free to contribute. The more factors you share, the faster the library will grow.
  • Custom factors can also be calculated using pricing, volume, fundamentals data, and soon will include data purchased from the Quantopian Store.
  • Factors and filters are passed via pipeline_output and are given to you as a dataframe. The index is all securities that pass any screens set, the columns are any factor or filter added to the pipeline.
  • There is currently no limit to the number of factors and filters you can pass, however there is a limit on processing time available to you in handle_data (50 seconds) and before_trading_start (5 minutes). We have increased the amount of time allotted to before_trading_start as part of this new API. Your pipeline will begin to generate data when pipeline_output is called the first time. We recommend doing this in before_trading_start, and then passing the results as needed via context.
  • An algorithm's real time updates are still limited to 500 securities. You should use the pipeline API in conjunction with update_universe to narrow down to the securities that you need minutely data on each day. We anticipate increasing this limit in the future as well.
  • The pipeline API is a work in progress. We wanted to get it to you as soon as possible, but there are known limitations. All of these are in progress and will be completed as soon as possible.
    • The pipeline API is not currently available in research, Quantopian paper trading, or IB paper or live trading.
    • Open price does not currently work when creating factors (high, low, close and volume do.)
    • All price data available to the pipeline API is split and dividend adjusted. This has shed some light on dividend data issues, including some dividends in foreign currencies.
  • We have a number of examples to help explain the details. More will be coming. To get you started, attached is a price and volume ranking example that I wrote based on this forum post. This has been a guide post user request as it calculates a couple of custom factors, filters the universe and then longs the top 5% and shorts the bottom 5%.
  • I will be hosting a webinar on the API on Thursday, October 8th to walk through the details of the API and answer questions. Sign up to join and learn more.

This is the API I first talked about back in May. Over the past months, I’ve tested it with a full range of quants — from novices in the community to some of the contest winners all the way to professional quants. To all of you who’ve helped out along the way, your feedback was invaluable. Thank you.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

96 responses

This example comes from this request in the forums. At the time it was requested there wasn't an easy way to accomplish this on Quantopian. Today there is.

To quote the request:

I am stuck trying to build a stock ranking system with two signals:
1. Trading Volume/Shares Outstanding.
2. Price of current day / Price of 60 days ago.
Then rank Russell 2000 stocks every month, long the top 5%, short the bottom 5%.

Looks great, thanks!

Karen,

Congratulations on the release!

On the help page, it appears that there is no way to do computations between securities (compare A to B in some fashion). Or am I misinterpreting (I just took a quick glance)?

Grant

You definitely can; when writing a custom factor, your compute function gets a full numpy array of assets by window-days. If you want to compare the output of various factors between various assets, you could do that with pandas in before_trading_starts.

But say I wanted to compute a factor for A, B, and C, comparing each to D? I think you are saying that I would get a numpy array with columns corresponding to A, B, C, & D. Then, my factor could be A compared to D, B compared to D, and C compared to D. I don't understand how this could be computed over 8,000+ securities in 5 minutes or less in before_trading_starts, unless the comparison is trivial. Or would it be done overnight?

If you do it all with vectorized calculations, it should be no problem at all. If you write a loop, then you will have a problem.

is_less_than_spy = factor[factor < factor[spy_idx]] # assuming pandas  

Should take milliseconds, at worst. Calculating a linear regression of every stock vs SPY should also be doable, but I am not sure if there will be time for that, since it cannot be vectorized across all the stocks AFAIK.

EDIT2: the column/row notation above might not be right, guessing.

I missed that the time-out for before_trading_starts is now 5 minutes (I recall it was 50 seconds). This should help.

feature is great!

have been able to cut down a few pages of code into something that fits in a single page (well almost)

only negative is that I have been hitting memory limits after adding too many factors

one small question

how big is the morningstar universe? because it seems like there are much more than 8000 stocks

Hi LMAGA,
I'm glad you like the API. We are very excited about it.

If you are willing to share your algo, we'd love to see it to try and help with the memory limits. Examples are the best way for us to work on the system. Just drop us a line at [email protected].

You are right there are more than 8000 stocks. There are more than 20K known securities in the entire Quantopian universe, but about 8000 active on any given day.

Great API, this will solve a problem I had in my algo. I'll try this right away.

Hi Karen,

I just noticed:

Factors and filters are calculated using daily pricing, volume and fundamentals data.

Does the API only work when running a backtest on daily bars? Or does it support running on minutely bars, but only uses daily bars for the pipeline thingy?

One concern I would have about using daily prices is that they are derived by capturing the last trade of the day, from a high-frequency stream of trades. Using one trade to represent the price for the entire day probably isn't the best practice. Do you have plans to extend the API to include minute bars?

Would you be able to provide an example of a custom factor that compares a set of stocks to a reference? For example, say I wanted to grab all NASDAQ stocks, and then compare each to QQQ. For example, I could code the current price of each stock with a z-score (using a 30 day window), and determine if it is higher or lower than the corresponding z-score for QQQ, and by how much.

Morning Grant!
The pipeline API works just fine with minute backtests, however it does use daily data in it's calculations in those backtests. The way we think of it, you are using the least granular data to narrow down the securities, then you can get minutely data to make your decisions. We do not have plans to extend to minutely data at this time. It wouldn't be practical for performance reasons.

I'll work on an correlation and/or zscore example. I don't have one on hand, but know I have to do one. Hopefully today or tomorrow.

Thanks Karen,

Sorta makes sense to use daily prices, although as Simon points out, if you have enough memory and computations can be vectorized, then maybe you could handle the additional two orders of magnitude of minutely data.

As complement to this new API, you might try to sort out how users can get data from the research platform to the backtester/trading API (e.g. https://www.quantopian.com/posts/linking-research-output-to-fetch-csv-input).

Grant

Wow!!!!!
Finally a technical screener...

Thanks Quantopian

@Grant low and high (plus volume) for each day are available too. Not only close price.

@Grant @Simon

You can do comparisons by writing CustomFactor code that looks like this:


class MeanDifferenceWithSPY(CustomFactor):  
    """  
    Factor computing the mean difference between each asset and SPY.

    This isn't a particularly useful thing to do, but it's illustrative of how  
    to do comparisons against a specific column of data.  
    """  
    inputs = [USEquityPricing.close]

    def compute(self, today, assets, out, close):  
        # 8554 is SPY  
        spy_column = close[:, assets.searchsorted(8554)]

        # Subtract the price of SPY columnwise from 2D close array  
        # The fancy indexing on `spy_column` is telling numpy to treat the  
        # array as a (1 x N) column instead of a flat 1-D array.  
        subtracted = close - spy_column[:, np.newaxis]

        # Most numpy functions can write directly into an output array, which  
        # saves making an extra copy.  
        nanmean(subtracted, axis=1, out=out)  

Adding more robust support for this sort of thing is on the roadmap for more development. There are a couple ways the engine can do this more efficiently for you if it knows what you want to do. Most notably, it can avoid doing the searchsorted call every day, though that's a pretty inexpensive call for the orders of magnitude we're dealing with:

In [1]: arr = arange(8000) # At any given time we tend to have between 8000 and 10000 assets in existence.

In [2]: %timeit arr.searchsorted(3000)  
1000000 loops, best of 3: 415 ns per loop  
Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Scott,

Pretty cryptic, unfortunately. I have to assume you know what you are doing, but it is mostly gibberish to me at this point. Is there a reason it doesn't look a bit more like Pandas/numpy or something I've seen before?

Grant

Grant,

The code is all dealing with numpy arrays, and each manipulation is explained in the comments. Seems pretty clear to me.

Thanks,
fawce

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Sorry, got a little cranky. Some questions:

Presumably, this line needs to be written like this:

class MyFancyFactor(CustomFactor):  

It can be called anything, but needs to have 'CustomFactor' as an argument.

This line must do something, but I don't see 'inputs' used anywhere:

inputs = [USEquityPricing.close]  

And presumably, USEquityPricing.close is giving me closing prices for all US equities, but what is the data structure? Why is it in square brackets?

And then this:

compute(self, today, assets, out, close)  

What do all of the arguments mean?

I don't see a window length? Why not? Does this mean it is looking back to around 2002 across all securities?

This presumably is a numpy array:

close[:, assets.searchsorted(8554)]  

But what does assets.searchsorted(8554) do?

With a little work, I can probably sort out what close - spy_column[:, np.newaxis] does, but I don't understand the structure of close. It seems you are just subtracting two numpy arrays? Doesn't close also contain the SPY data, so you just get zeros?

nanmean seems simple enough, but why isn't it np.nanmean (while above we have np.newaxis)?

If you are using numpy arrays without the security column labels used in a DataFrame, how do you know which column is which? I thought it was a good thing to use Pandas DataFrames, such as is returned by history()? Generally, skimming over the documentation, there appears to be a pretty steep learning curve, very different from the Pandas-based history() API. Kinda daunting, but I'll learn by example, I suppose.

See, I don't actually understand (not that Scott's response wasn't clear...I just have some things to learn).

FWIW, I plan to use pandas inside these custom factors, until such time as I cannot. I believe all that one would need to do is pandas.DataFrame(close,columns=assets) and that would be enough to start working with them. I think the main limitation is that the output factors must be the same shape as the input data, so whatever you are calculating, it pretty much needs to be calculated for exactly every asset. If you want to make summary statistics, I think maybe the better place to do that is with pandas using raw factor data inside before_trading_starts. But, I haven't used this version of the API yet.

Scott - is today still just the last row, or is it the vector of dates of length window length?

Karen,

is one pipeline only allowed per algorithm?
Or can I attach more than one, e.g. for exploiting different strategies to be used in different market condition?

Is the name mandatory?
Or can I access pipeline_output also by using a global variable (in the context)?

@Nicola

is one pipeline only allowed per algorithm?
Or can I attach more than one, e.g. for exploiting different strategies to be used in different market condition?

Currently only one pipeline is supported per algorithm. Names are part of the required API because we expect that we'll want to support multiple pipelines in the future.

@Simon

Scott - is today still just the last row, or is it the vector of dates of length window length?

It's still just the last row. Switching it to be an array of datetime64[ns] is in the roadmap, but fixing this correctly requires a somewhat significant change to a Pipeline API core datastructure, which makes the (engineering cost / user benefit) ratio not great.

I do want to make this change sooner rather than later though, since it will be a breaking change in the API.

Karen,

"The pipeline API is not currently available in research, Quantopian or IB paper trading or live trading."

Any idea when the timetable is scheduled to make this available? What is the rationale for making an api available only in Zipline? I'm sure there is one,
but it eludes me.

@Grant

I think many of your questions are answered in the API Reference for CustomFactor, but there should probably be a better explanation of how CustomFactors work in the prose documentation as well.

A few responses to specific questions:

Presumably, this line needs to be written like this:
class MyFancyFactor(CustomFactor):

It can be called anything, but needs to have 'CustomFactor' as an argument.

This is just the Python syntax for defining a subclass. When you create your own factor, you do so by creating a subclass of CustomFactor that implements a compute function meeting the expected interface. That interface is described in the API Reference as follows:

Users implementing their own Factors should subclass CustomFactor and implement a method named compute with the following signature:

def compute(self, today, assets, out, *inputs):  

On each simulation date, compute will be called with the current date, an array of sids, an output array, and an input array for each expression passed as inputs to the CustomFactor constructor.

The specific types of the values passed to compute are as follows:

today : np.datetime64[ns]  
    Row label for the last row of all arrays passed as `inputs`.  
assets : np.array[int64, ndim=1]  
    Column labels for `out` and`inputs`.  
out : np.array[float64, ndim=1]  
    Output array of the same shape as `assets`.  `compute` should write  
    its desired return values into `out`.  
*inputs : tuple of np.array  
    Raw data arrays corresponding to the values of `self.inputs`.  

The essential idea is that a CustomFactor is constructed with two pieces of state: inputs, which is a list of Column objects (e.g. USEquityPricing.close), and window_length which is an integer >= 1. Columns are essentially just fancy sentinel values that tell the underlying engine what data to provide to the factor, and window_length tells the engine how many rows of input to provide.

This line must do something, but I don't see 'inputs' used anywhere:

inputs = [USEquityPricing.close]  

And presumably, USEquityPricing.close is giving me closing prices for all US equities, but what is the data structure? Why is it in square brackets?

For factors like SimpleMovingAverage that are naturally parametrizable, it makes sense to require the user to provide inputs and window_length every time they construct a Factor. Thus SimpleMovingAverage provides no defaults, and you have to explicitly write something like SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10).

For factors likeVWAP it makes sense to provide a default for inputs, but not for window_length. When you construct a VWAP instance, you generally do so like this: vwap_10 = VWAP(window_length=10). There are also factors like RSI or Latest that provide defaults for window_length but not inputs.

All of this means that there should be a nice clean way to say "this particular factor should provide a default inputs list or a default window_length". The easiest way to do this with a CustomFactor is to declare it as a class-level attribute, which is what we're doing in the statement quoted above. The line inputs = [USEquityPricing.close] is defining inputs as a list of length 1 (the square brackets are just the syntax for list literals, ala [1, 2, 3]) containing only USEquityPricing.close.

Again, the important thing to understand here is that the column objects passed as inputs are just fancy sentinel values: they tell the engine what data to pass to your compute function when it's called each day. For each such sentinel value you declare in your inputs, your compute function gets passed a numpy array containing the requested data.

You can find the source of most of the builtin Factors in Zipline if you want to see some more examples.

close[:, assets.searchsorted(8554)]  

But what does assets.searchsorted(8554) do?

Say I have a CustomFactor defined as follows:

class SomeFactor(CustomFactor):  
    inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
    window_length = 5

    def compute(self, today, assets, out, closes, highs, lows):  
        # Do some stuff.  

If I construct and add an instance of this factor to a pipeline by doing something like pipe.add(SomeFactor(), 'name'), then every simulation day my compute function will be called with a bunch of data:

  • today will be the day for which compute is being called.
  • closes, highs, and lowswill all be 5 x N numpy arrays, where N is the number of assets in our database on the day in question. There are 5 rows in all the arrays because we provided a default of 5 for window_length.
  • assets will be an integer array of length N, representing column labels for closes, highs, and lows.
  • out will be an empty array of length N. The job of compute is to write output values into out. We write values directly into an output array rather than returning them to avoid making unnecessary copies. This is an uncommon idiom in Python, but very common in languages like C and Fortran that care about high performance numerical code. Many numpy functions take an out parameter for similar reasons (I use this with nanmean in my example above).

The original question that prompted my example was asking how to do a computation comparing every asset to a particular known asset. Since we have a column for every asset in our database, what we want to do is get the index of the column corresponding to the asset we care about. For a sorted array (which assets always is), the fastest way to do this in numpy is searchsorted:

In [7]: import numpy as np

In [8]: x = np.array([1, 3, 5, 6]); x  
Out[8]: array([1, 3, 5, 6])

In [9]: x.searchsorted(1)  
Out[9]: 0

In [10]: x.searchsorted(5)  
Out[10]: 2  

When we do spy_column = close[:, assets.searchsorted(8554)], we're saying "pull out the column of closes corresponding to the index of asset 8554 (which is SPY)". The result of this operation will be a 1-D array containing the last window_length closes for SPY. This doesn't mutate close in any way; it's still a 2D array containing a window of closes for all the assets we know about. When we then do:

subtracted = close - spy_column[:, np.newaxis]  

we're saying "create a new 2D array containing the results of every column in close minus the closes of spy. The fancy indexing on spy_column is telling numpy to treat spy_column as a 1 x N column array. If we didn't do this, we'd get an error from Numpy telling us that our arrays are the wrong shape:

In [13]: data = arange(15).reshape(5, 3); data  
Out[13]:  
array([[ 0,  1,  2],  
       [ 3,  4,  5],  
       [ 6,  7,  8],  
       [ 9, 10, 11],  
       [12, 13, 14]])

In [14]: first_column = data[:, 0]; first_column  
Out[14]: array([ 0,  3,  6,  9, 12])

In [15]: data - first_column  
---------------------------------------------------------------------------  
ValueError                                Traceback (most recent call last)  
<ipython-input-15-574b6935bc49> in <module>()  
----> 1 data - first_column

ValueError: operands could not be broadcast together with shapes (5,3) (5,)

In [16]: data - first_column[:, np.newaxis]  
Out[16]:  
array([[0, 1, 2],  
       [0, 1, 2],  
       [0, 1, 2],  
       [0, 1, 2],  
       [0, 1, 2]])  

nanmean seems simple enough, but why isn't it np.nanmean (while above we have np.newaxis)?

This is just how I happened to write the example. If you had only done import numpy as np, then you'd have to do np.nanmean.

I'm going to adapt some of this response and add it to the main prose docs for the Pipeline API.
Hope this helps,
-Scott

I also highly recommend reading the Numpy docs on indexing as a starting point for getting familiar with numpy.

Over time I've found myself using pandas less and less and using numpy more and more, especially when I'm doing purely numerical work.
Pandas is really convenient for IO (e.g. reading data from CSVs or SQL databases), and it's good at "data science"-ey tasks like resampling and group-by operations, but it incurs a lot of overhead trying to figure out what you mean and it's got some strange behavior in a lot of corner cases. (See, for example, this bug fixed by one of our engineers this week.)

@Serge

"The pipeline API is not currently available in research, Quantopian or IB paper trading or live trading."
Any idea when the timetable is scheduled to make this available? What is the rationale for making an api available only in Zipline?

This sentence was a bit poorly worded. The Pipeline API is available in the Quantopian backtester right now. It's not available in Quantopian paper trading yet. We want to get feedback on the API from folks using it in the backtester so we can make any needed tweaks before allowing it in live trading. That said, we plan on making it available in both research and paper trading Soon (TM).

Ah, ok. That makes more sense.

Thanks

Thanks Scott,

Your response moved me up the learning curve. I'm o.k. with numpy, since it is basically Python's version of MATLAB.

The idea is to output an array of length N, with labels given by assets (i.e. a scalar for each asset is output). However, it seems that a hack could be devised to re-purpose the labels and then one could output any ol' data, so long as the total number of elements is not more than N. For example, if N >= 18, I could output 3 vectors, each of length 5 using the first 15 elements. Then, for next 3 elements, I could record labels for my 3 vectors (e.g. the sid IDs for the vectors). Would something like this work?

@ Karen,

Why did you decide to limit this to daily bars? If I understand correctly, computations would not need to be done over the entire 8000+ security database. Say I wanted to look at 100 stocks over a window of 3 days, using minute bars. So, I'd have 3x390x100 = 117,000 data points. This is compared to 30*8000 = 240,000 data points for a 30 day window over all stocks. Is the idea that history() could be used in before_trading_starts() to do computations over minute bars?

Also, within before_trading_start(), is there a way to catch a time-out error? Or will the algo just crash?

I don't understand "In backtesting, they are calculated in bulk, once per year." Is this done before the backtest even starts? Or will the backtest pause every year as it runs, to do the computation? And is the computation limited to 5 minutes? Or could it take longer? In backtesting, is there a way to turn off the pre-computation, to verify that the algo will run live?

Nice addition, at some point it would be nice to have it working asynchronous like sockets do so we are not limited by this 50secs time-out. Should have a queue and check if there is data if not just move one. So the factor could "Miss" some bars while computing.

This time-out is kinda dangerous

Hi,

I want to screen stocks for pair trading based on 15 minute interval data. Does pipeline only support daily prices or can it handle minute data as well?

Best regards,
Pravin

Hi Pravin - pipeline is daily only.

Hi Fawce,

Why is it daily only? And do you have plans to expand it to minute bars? As I commented above, it would seem to be a matter of setting a limit on the number of data points, rather than daily versus minutely.

Grant

Hi, great work guys !

Is it possible to combine normalized factors (by normalized I mean take the value and normalize across all securities in my universe to between 0 and 1)?

For example say I have a weighting system that says weight= (0.3*volatiltyFactor) + (0.7*fundamentalFactor). Or does the rank function already normalize ?

Thanks

heh, so I cloned it to see how it works, and rank is literally the rank in the universe. So essentially it is the normalized value. Cool.

How could one get different time horizons for the fundamentals data?

As far I know, the fundamentals data always attain to the most recent quarter.

For example, how can I write a factor that takes care of the Total Revenue of the last 12 months?

My first thought was the following approach (assuming a quarter is composed by 63 days) but I'm not sure, it's very reliable.

class RevenueTTM(CustomFactor):  
    inputs = [morningstar.income_statement.total_revenue] 

    # 252 trading day in a year  
    window_length =252  
    def compute(self, today, assets, out, revenue):  
        # Assuming a quarter is 63 days  
        out[:] = revenue[0] + revenue[62] + revenue[182] + revenue[251]  

The main risk is to sum the same quarter two times.
There must be better way! Any suggestions?

Hi Costantino,

You're definitely on the right track!

Take a look at the algo I posted here: https://www.quantopian.com/posts/fundamental-history-based-algo#5616e27ea8e03d7580000155.

This algo compares the EPS from the last quarter to the current one. You could apply a very similar logic to your algo to get the 4 most recent unique prices. This should take care of the edge cases you are worried about!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Jamie!

It's quite an hack, but for sure a more robust solution to the problem... I hope, the API will soon manage to handle such cases easier :-)

I'll post my algorithm for the TTM, when finished.

@Grant

The main reason that Pipelines don't support minutely data is that we depend on pre-fetching data for reasonable performance during backtesting, and the memory overhead of doing that pre-fetching is considerably worse when working with minutely data.

When you run a Pipeline that uses a 90-day trailing window, we don't just query the last 90 days every day, since that would be glacially slow because you'd spend almost all of your backtest time doing disk/network IO. Instead, what we do is load all the data we think we'll need to run your pipeline for the next ~252 days (the number of trading days in a year). We use that cache to pre-compute your entire pipeline for a year and then feed the results to your backtest on the dates when it would become available. This makes rank()computations in particular much faster, because we can pre-compute ranks for an entire year in Cython on one big block of contiguous memory.

In the simplest possible case of window_length=1, we load data in ~yearly blocks, and an optimistic lower bound on the memory overhead of a Pipeline is about 16 MB per requested column:

In [1]: from operator import mul

In [2]: reduce(mul, [  
252, # days  
8000, # assets per day  
8, # bytes per data point  
])
Out[2]: 16128000

In [3]: from humanize import naturalsize

In [4]: naturalsize(_2)  
Out[4]: '16.1 MB'  

This means that we incur at least 16 MB of memory overhead per column when operating on daily data. In practice the actual number is probably closer to 20 or 30, but 16 is a reasonable lower bound. These values are well within the tolerance of acceptable overhead on top of an algorithm writer's business logic.

Suppose now that Pipelines supported minutely data. If we wanted to get the previous day's minutes in our pipelines each day, we'd have a window_length of 390 (the number of market minutes each day). Notice, however, that there's no overlap in the trailing windows we see each day: every day requires a totally independent set of 390 bars. This means that our lower bound calculation becomes:

In [7]: reduce(mul, [  
   ...: 252, # days of pre-cached data  
   ...: 390, # rows of data per day  
   ...: 8000, # assets per day  
   ...: 8, # bytes per data point  
   ...: ])  
Out[7]: 6289920000

In [8]: naturalsize(_7)  
Out[8]: '6.3 GB'  

6.3 GB per column is well above what we can support, and that's hit just querying a day's worth of minutely data (also note that this means we're doing 6.3 GB of network or disk IO per backtest-year, independently of any precomputation strategies. That will be slow even if we're not paging 3/4 of our RAM to disk because we're almost out of memory.)

The obvious response here is to do less precomputation when working with minutely data. It's possible that that could be made to work, but at that point you're just querying for a trailing minutely window every day, in which case you might as well just use history().

Speaking more philosophically, there will always be tension between wanting to look at data for lots of assets and wanting to look at data with high resolution. In the presence of CPU and memory constraints, you can either work with data for lots of assets at low resolution, or you can look at data for fewer assets at higher resolution. Up until now on Quantopian, we've essentially forced our users to choose "look at data for (comparatively) few assets at high resolution". The Pipeline API, in my mind, is essentially about providing an option to access the other end of the spectrum. Note that you don't have to choose just one of these options for your entire algorithm: you can (and should!) use the Pipeline API to do a wide screen that identifies assets that you want to investigate more closely, and then use history to look at those particular assets more closely. There's a bunch of exciting work happening now on the Zipline core to make this pattern more efficient.

Hi, I'm just getting started with Quantopian, and am trying to get a very minimal Pipeline algorithm to work, but it doesn't appear to ever order anything. I'm sure I'm missing something basic, can someone help me out?

Thanks, this is a great site, and I love the Pipeline structure.

# Trying to get the minimum Pipeline algorithm to run.  
# This example uses pipeline to set its universe daily.

# Import the libraries we will use here  
from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline  
from quantopian.pipeline.data.builtin import USEquityPricing  
from quantopian.pipeline.factors import SimpleMovingAverage  
import math

# The initialize function is the place to create your pipeline and set trading  
# conditions such as commission and slippage.  
def initialize(context):  
    # Set execution cost assumptions. For live trading with Interactive Brokers  
    # we will assume a $1.00 minimum per trade fee, with a per share cost of $0.0075.  
    set_commission(commission.PerShare(cost=0.0075, min_trade_cost=1.00))  
    # Set market impact assumptions. We limit the simulation to trade up to 2.5%  
    # of the traded volume for any one minute, and  our price impact constant is 0.1.  
    set_slippage(slippage.VolumeShareSlippage(volume_limit=0.025, price_impact=0.10))  
    # Rebalance every Monday (or the first trading day if it's a holiday).  
    # At 11AM ET, which is 1 hour and 30 minutes after market open.  
    schedule_function(rebalance,  
                      date_rules.week_start(days_offset=0),  
                      time_rules.market_open(hours = 1, minutes = 30))  
    # Create, register and name a pipeline.  
    pipe = Pipeline()  
    attach_pipeline(pipe, 'min_pipeline_example')  
    # Construct Factors.  
    sma_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10)  
    sma_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30)

    # Construct a Filter.  
    prices_under_5 = (sma_10 < 5)

    # Register outputs.  
    pipe.add(sma_10, 'sma_10')  
    pipe.add(sma_30, 'sma_30')

    # Remove rows for which the Filter returns False.  
    pipe.set_screen(prices_under_5)

# Called every day before market open. This is where we access the securities  
# that made it through the pipeline.  
def before_trading_start(context, data):  
    # Pipeline_output returns the constructed dataframe.  
    context.output = pipeline_output('min_pipeline_example')  
    context.long_list = context.output.sort(['sma_30'], ascending=False).iloc[:100]  
    # Update our universe to contain the securities that we would like to long.  
    update_universe(context.long_list.index)  
# This rebalancing is called according to our schedule_function settings.  
def rebalance(context,data):  
    # Set the allocations to even weights in each portfolio.  
    weight = 1 / len(context.long_list)  
    # For each security in our universe, order positions according  
    # to our context.long_list, and sell all previously  
    # held positions not in the list.  
    for stock in context.long_list.index:  
        if stock in data:  
            order_target_percent(stock, weight)  
    for stock in context.portfolio.positions.iterkeys():  
        if stock not in context.long_list.index:  
            order_target(stock, 0)  
    # Log the orders each week.  
    log.info("This week's orders: "+", ".join([long_.symbol for long_ in context.long_list.index]))  
# The handle_data function is run every bar.  
def handle_data(context,data):  
    # Record and plot the leverage of our portfolio over time.  
    record(leverage = context.account.leverage)  
    print "Long List"  
    log.info("\n" + str(context.long_list.sort(['sma_30'], ascending=True).head(10)))  

Nevermind - looks like coercing the weight variable used in order_target_percent to a float type does the trick. thanks anyway!

@Elizabeth

Looks like the issue is here:

weight = 1 / len(context.long_list)  

Integer division in Python 2 (which is what Quantopian runs on) truncates toward negative infinity by default:

In [3]: 5 / 2  
Out[3]: 2  

This means that 1 / N == 0 for any integer N > 1, so you're always ordering with a weight of 0.

In Python 3, this behavior was changed to produce float values instead:

Python 3.4.3 (default, Jul 28 2015, 18:20:59)  
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.  
>>> 5 / 2  
2.5  

You can get this behavior in Python 2 by putting from __future__ import division at the top of a file:

Python 2.7.10 (default, Jun 30 2015, 15:30:23)  
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.  
>>> from __future__ import division  
>>> 5 / 2  
2.5  

If you don't want to disable this behavior globally for whatever reason, you can also divide by a float instead of an int, which will behave as you expect:

Python 2.7.10 (default, Jun 30 2015, 15:30:23)  
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.  
>>> 5.0 / 2  
2.5  

Ha...you fixed the issue before I finished typing a reply.

Thanks anyway :) Good to know about it running on Python 2.

@ Scott,

Thanks for the explanation. So are you planning to make history() available in before_trading_start()? Also, am I mistaken, or does before_trading_start() not run on the first day of a backtest?

Grant

Is it explained somewhere how the factor computations deal with security start and end dates? I understand that "Assets will be masked out the day after their end date." as Fawce explained on https://www.quantopian.com/posts/dollar-volume-pipeline. But what about stocks that have their start dates within the window? How are they managed, as the backtest runs? It sounds like they make it though to update_universe, but is it then up to the author of the CustomFactor to sort out how to deal with securities that don't fill the entire window?

Hi, been working with quantopian for a long time now, and slowly improving my coding.

This pipeline is fantastic.
A basic question, if we want to use the screen to select the top 2000 stocks (an initial universe) the create a factor (6 month share price growth) for example, take say the best 20% of these, then rank this subset against another factor say (price to book).

What's the most efficient way to code these?

I can see from the examples how to achieve the first two steps easily, but not sure how to use the output of the this with the third.
Paul

Hi PaulB,
Here is an example that I think shows what you are trying to do. There are two key things to making this happen. The first is the percentile between function which allows you to easily find the top 20% of any factor. The second is the mask option, to mask out the stocks you don't care about.

I ranked the results by the pb_rank and then invested in the top 200 (reinvesting monthly) but you could do a lot with the information provided in the results.

Let me know if you have any questions.

Thanks very much Karen. Very helpful!!

First, Thank you for this Pipeline API, which opens door for stocks screenings.

What else can we import from quantopian.pipeline.data.builtin ?
I want to access the SPY closes but couldn't using USEquityPricing[index].close where index is the SPY sid

I want to calculate the beta coefficient for each stock for the last X days, but I have to access the SPY closes first,
so I can compute the the beta with covariance of returns of both stock and SPY / variance of SPY index

To access the close price:

  • Add this to your the pipeline in initialize (and make sure to not exclude SPY from pipeline results when screening):
price  = Latest(inputs=[USEquityPricing.close])  
pipe.add(price, 'close_price')  
  • In before_trading_start do the following:
output = pipeline_output('pipeline')  
context.pipeline_price =  output['close_price']  
  • In handle_data or before_trading_start you can access the price like this:
context.pipeline_price[index]  

Please note that the close prices returned from Pipeline will be "as-traded" for that day (see this).

Luca,

As mentioned above, I want to access the SPY closes whithin the initialize function to calculate beta for stocks , not in handle_data

Thank you

Adham,
Take a look at Scott Sanderson's first comment in this thread above ^^ he shows how to do this.

KR

I did, although I couldn't understand the syntax he used I tried it and I got this message:
SecurityViolation: 0002 Security Violations(s): Accessing np.newaxis

@Adham

It looks like numpy.newaxis wasn't whitelisted in our security sandbox. I've updated the security definitions to allow newaxis, but it takes a few days for that change to propagate to the main site.

In the meantime, numpy.newaxis is actually just an alias of None, so you can use None anywhere that you'd use newaxis.

In [11]: from numpy import arange, newaxis

In [12]: x = arange(5)

In [13]: x  
Out[13]: array([0, 1, 2, 3, 4])

In [14]: x[:, newaxis]  
Out[14]:  
array([[0],  
       [1],  
       [2],  
       [3],  
       [4]])

In [15]: x[:, None]  
Out[15]:  
array([[0],  
       [1],  
       [2],  
       [3],  
       [4]])  

(See the numpy docs) for details.

Adham, if you have beta calc code, I'd love to see it. This was mine so far. For anyone interested, here's finviz beta (on the Tickers tab you could copy those in order to quickly see in current code what effect a mask for low beta stocks might have).

Sure, I will post it when I finish coding it

below code returns the column of closes for the passed sid =SPY, how can I get the current sid the pipeline is calculating so I can get that sid closes column.?

spy_column = close[:, assets.searchsorted(8554)]  
current_stock_column = close[:,assets.?] # what is the method ?  

As I recall, there is a general Quantopian rule that a backtest has to be repeatable (run 1 must return the same results and run 2, for the same capital & start/end dates). Would this rule out random sampling of securities using pipeline? Or is there a work-around by using a fixed random number generator seed? Just thought I'd ask in case there is a "gotcha" before I try it.

One thing I would like to see: a good way to use the pipeline on data taken at intervals greater than a day, e.g. every month for several years. I'm trying to build an algorithm using pipeline plus the long-term performance of certain assets, and I keep getting time-out errors.

Yes, I've the same timeouts problems when I try to extract quarterly fundamental data.
More details in this post:
https://www.quantopian.com/posts/fundamental-history-based-algo#561fb3b1bf797603590006ac

+1 for Grant's question: contest rules regarding random sampling stocks.

Do I need to call set_universe before using pipeline to construct my own universe of securities?

One thing I'd really like to be able to do: calculate pipeline-style factors on manually chosen securities, for example, using the dividend-adjusted price history of bond ETFs to estimate the risk-free rate of return. In general, having dividend-adjusted prices seems like a big deal and our access to that data should be as flexible as possible

+100 to this. Dividend-adjusted prices are really important, but it's tough to use them only in Pipeline. I would really like to fit models on dividend-adjusted one-minute data.

Do we have an ETA for when it will be possible to use Pipeline in the contest? I've spent much of my free time in the past week working on a Pipeline-based algo that I'm quite excited about, and I'm itching to enter it in the contest.

Hi Christopher,
Glad to hear you have something you are excited about. We have people actively working on getting the pipeline into paper trading. I expect it's on the order of a few weeks away. Around that point you will be able to submit contest entries using it.

We are also working on dividend adjusted pricing throughout the backtester. That one is a little further out, but we are also actively working on it. (Research will come after that.)

I'd love more information about wanting to use the pipeline on manually chosen securities. I've heard this a few times, and was surprised. Is the key reason that you want to do ETFs vs companies? Or is there another reason why the pipeline filtering process doesn't work for you?

Thanks,
KR

Regarding timeout issues that a number of you have mentioned, we released some performance improvements this morning that should help. Please let me know if you continue to see issues.

Hi Karen,
so you confirm that for next contest deadline on November 2, algorithms using Pipeline will not be accepted?

Okay, so I figured out that when creating custom factors, because "assets" is an array of IDs, it's easy to compare those IDs against a list of hand-picked IDs to create a factor that is equal to 1 if the ID is in the list, and 0 otherwise. Currently, my algo uses this to build a list of securities that's mostly selected by dollar-volume, but will always include SPY, SHY, and TLT (and will have certain other custom factors computed for those three ETFs, as well as the stocks selected by dollar-volume).

Nicola,
Pipeline algos will be accepted in the contest as soon as they work in paper trading. We are working on that as fast as possible. (We want it too!)

Unfortunatly I still get a Timeout Error, when I try to extract single quarter fundamental data.
Here I'm attaching the notebook for the algorithm and in the next post the backtest...

and here is the algo+backtest (please set as CustomFactor RevenueTTM3 at line 122).

My goal is to extract the single quarterly entries for a fundamental dataset set with window_length = 252.
This is necessary for example, if you need annual fundamental data oder quarterly changes in the last 12 months.

Thanks for posting this algo Constantino. We'll look into the timeout issue.

Hi Karen,

to reproduce the Timeout, do not forget to set RevenueTTM3 at line 122 as CustomFactor.

My goal is to get Trailing Twelve Months (TTM) fundamental data to create a Custom Factor for computing the Piotroski and Beneish Score.

If there is a more convenient way to get TTM data in a Pipeline factor, let me know... I participated also to this discussion
https://www.quantopian.com/posts/fundamental-history-based-algo#561fb3b1bf797603590006ac
and it seems, there is at the moment no other simpler way to do this.

Thanks!
Costantino

How does one filter out chinese stocks using the pipeline?

I am looking to get something like:

  • Top 20% or top 20 from the pipeline
  • Bottom 20% or bottom 20
  • Between 40% to 75% between 100 and 150 in rank
  • 10% +/- or 20 items around from value x
  • etc.

In short top or bottom n% or m items, items between n%, m% or rank, % or number of items around a given pivotal value.

Is it possible to add an example on how this can be done.

Also how to successively apply a filter. E.g. apply filter 1 and on this universe apply filter 2 followed by filter 3.

With the new APIs is it possible to add better auto completion and the ability to discover the white listed APIs from within the IDE.

The pipeline looks perfect to me.

My question is whether pipeline API is eligible to implement in IB paper trading or even real money trading. If not, when is the time frame we can expect it comes?

Hi Adam,

Currently, pipeline is supported in backtesting and Quantopian (zipline) paper trading. We are currently working on bringing it to IB paper and real money trading. There's no specific timeline but it's at the top of our list!

Karen, I initially thought pipeline overcomplex, but the more I use it, the more I appreciate it!

Anyway, I note that several of the built-in factors use EMA. I assume this is it to reduce the computational burden in backtests: these can be calculated based on 1 days history and 1 multiplication, whereas SMA require the entire window_length history and multiplications. Is there a special implementation trick you've used for this, or do I just use a window_length of 1.

EDIT: specifically I am trying to implement ATR using the conventional Wilder's EMA. At any point in the time, you need all the historic high low close prices to calculate the next ATR, but the shortcut is to use the previous day's ATR and todays high low close. Is it possible to get yesterday's ATR in a CustomFactor?

Hi Karen,

I think that my query may be similar or at least related to Dan's above. I am trying to build a custom factor using the Williams Percent R method from talib. Here is my code:

from talib import WILLR   

# Create custom factor to calculate Williams %R  
class Williams(CustomFactor):  
    # Pre-declare inputs and window_length  
    inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
    window_length = 10  
    # Compute value

    def compute(self, today, assets, out, close, highs, lows):  
        #calculate Williams percent R for the stock  
        out[:] = WILLR(highs,lows,close, timeperiod=10)  

However, when I try to run this I get the following error: AssertionError: high has wrong dimensions
There was a runtime error on line 101.

Any idea where I'm going wrong?

I think WILLR probably wants a float, rather than a vector. You have to do some looping or a map operation. If you search the community for ATR you will see something similar.

Or it may be it wants a time series for one ticker, but you are passing a time series for multiple tickers.

Thanks Dan - I think it's the second thing you mention - am working on a solution now, will post if I nail it...

talib's function signatures are a bit awkward to use in Pipeline right now. There are three major issues with them that I'm aware of:

  1. TALib's functions assume that they're receiving all the data at once, and they perform rolling computations in their inputs.
  2. TALib's functions only work on one asset's worth of data at a time.
  3. Many TALib functions don't robustly handle NaNs. Since Pipeline often provides columns with leading NaNs (e.g. for a company that's only existed for 5 days when you've requested a 10 day window.), you have to manually ignore columns containing NaNs.

As a concrete example, the WILLR function mentioned here has essentially the following signature/semantics:

# The actual docstring for WILLR is much less informative than this :(.  
def WILLR(high, low, close, timeperiod=14):  
    """  
    Parameters  
    ----------  
    high : ndarray[float]  
        1D array of historical highs for a single asset.  
    lows : ndarray[float]  
        1D array of historical lows for a single asset.  
        Must be the same length as highs.  
    close : ndarray[float]  
        1D array of historical closes for a single asset.  
        Must be the same length as highs.  
    timeperiod : int, optional  
        Lookback window over which to compute the transformation.  
        Default value is 14 periods.

    Returns  
    -------  
    out : ndarray[float]  
        A 1D array of the same length as the input arrays.

        The first `timeperiod` entries will be NaN.  The remaining entries  
        contain the result of computing William's %R on a trailing window of  
        `timeperiod` entries.  
    """  

If we want to use this in Pipeline, we have to jump through a few hoops:

  1. We have to call it roughly 8000 times, once for each asset column.
  2. We have to call it on a window of length timeperiod and then extract the last entry. (I haven't looked much into the underlying TA-Lib implementations to see whether this is a significant performance hit.)
  3. We have to manually screen out columns containing NaNs if the TALib function we want to use doesn't support them effectively.

I've attached a notebook with a discussion of the issues and a full example showing how I'd do this in Pipeline.

Generally the calculations for indicators are pretty simple, and can probably be implemented in Python using NumPy using vector and matrix operations. This would probably be 100x faster than looping through thousands of TALib calls.

Generally the calculations for indicators are pretty simple, and can probably be implemented in Python using NumPy using vector and matrix operations. This would probably be 100x faster than looping through thousands of TALib calls.

This is definitely true, though the flipside of this is that most Pipelines using anything but USEquityPricing data are currently bottlenecked on IO. In general, it's a bad performance practice to be iterating over lots of numpy arrays in pure python, but there's some value in being able to quickly pull in a TA-Lib function to see if it's even worth rewriting for performance.

@Scott - I was hitting up against the exact problems you mentioned, so thank you for providing some clarity as to why
@Dan - I went with your suggestion and started from scratch using vector and matrix operations and came up with this:

class Williams(CustomFactor):  
    # Pre-declare inputs and window_length  
    inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
    window_length = 10  
    # Compute value  
    def compute(self, today, assets, out, close, high, low):  
        #calculate Williams percent R for the stock  
        highs = np.amax(high, axis=0)  
        lows = np.amin(low, axis=0)  
        out[:] = (highs-close[-1])/(highs-lows)*-100  

I haven't actually checked the numbers it's producing against a second data source, but just eyeballing them they at least look like Williams %R numbers.
Finally @Dan - any thoughts on the speed of this approach? Should be pretty quick unless np.amax and np.amin slowing things down?

Looks pretty pythonic to me :)

Is it possible to generate OHLCV daily bars from the minutely data, and then feed them to pipeline? For example, say I wanted to smooth the minutely price data and then assign OHLC daily values based on the smoothing results, then feed them to pipeline?

Is it possible to select Dow Jones Industrial Stock with this? For instance if I wanted to pick 5 stocks out of the 30 stocks based on some factors.

Is one pipeline only allowed per algorithm still?

Is functionality to support multiple Pipelines in a single algorithm still in the roadmap?

Multiple pipelines possibility in the future being discussed here: https://www.quantopian.com/posts/multiple-pipelines

When learning, it can be useful to know where the edges are. This is about the smallest viable working pipeline algo for scheduled ordering I think. The screen and import of Q500US are not required however it would take a long time to process all 8000+ stocks.

from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline  import Pipeline  
from quantopian.pipeline.filters import Q500US

def initialize(context):  
    schedule_function(trade, date_rules.every_day(), time_rules.market_open())  
    attach_pipeline(Pipeline(screen=Q500US()), 'z')

def before_trading_start(context, data):  
    context.stocks = pipeline_output('z')

def trade(context, data):  
    for s in context.stocks.index:  
        order(s, 1)