Introducing the Pipeline API

This example comes from this request in the forums. At the time it was requested there wasn't an easy way to accomplish this on Quantopian. Today there is.

To quote the request:

I am stuck trying to build a stock ranking system with two signals:
1. Trading Volume/Shares Outstanding.
2. Price of current day / Price of 60 days ago.
Then rank Russell 2000 stocks every month, long the top 5%, short the bottom 5%.

Disclaimer

Looks great, thanks!

Karen,

Congratulations on the release!

On the help page, it appears that there is no way to do computations between securities (compare A to B in some fashion). Or am I misinterpreting (I just took a quick glance)?

Grant

You definitely can; when writing a custom factor, your compute function gets a full numpy array of assets by window-days. If you want to compare the output of various factors between various assets, you could do that with pandas in before_trading_starts.

But say I wanted to compute a factor for A, B, and C, comparing each to D? I think you are saying that I would get a numpy array with columns corresponding to A, B, C, & D. Then, my factor could be A compared to D, B compared to D, and C compared to D. I don't understand how this could be computed over 8,000+ securities in 5 minutes or less in before_trading_starts, unless the comparison is trivial. Or would it be done overnight?

If you do it all with vectorized calculations, it should be no problem at all. If you write a loop, then you will have a problem.

is_less_than_spy = factor[factor < factor[spy_idx]] # assuming pandas

Should take milliseconds, at worst. Calculating a linear regression of every stock vs SPY should also be doable, but I am not sure if there will be time for that, since it cannot be vectorized across all the stocks AFAIK.

EDIT2: the column/row notation above might not be right, guessing.

lets make america great again

I missed that the time-out for before_trading_starts is now 5 minutes (I recall it was 50 seconds). This should help.

feature is great!

have been able to cut down a few pages of code into something that fits in a single page (well almost)

only negative is that I have been hitting memory limits after adding too many factors

lets make america great again

one small question

how big is the morningstar universe? because it seems like there are much more than 8000 stocks

Hi LMAGA,
I'm glad you like the API. We are very excited about it.

If you are willing to share your algo, we'd love to see it to try and help with the memory limits. Examples are the best way for us to work on the system. Just drop us a line at [email protected].

You are right there are more than 8000 stocks. There are more than 20K known securities in the entire Quantopian universe, but about 8000 active on any given day.

Disclaimer

Luca

Great API, this will solve a problem I had in my algo. I'll try this right away.

Hi Karen,

I just noticed:

Factors and filters are calculated using daily pricing, volume and fundamentals data.

Does the API only work when running a backtest on daily bars? Or does it support running on minutely bars, but only uses daily bars for the pipeline thingy?

One concern I would have about using daily prices is that they are derived by capturing the last trade of the day, from a high-frequency stream of trades. Using one trade to represent the price for the entire day probably isn't the best practice. Do you have plans to extend the API to include minute bars?

Would you be able to provide an example of a custom factor that compares a set of stocks to a reference? For example, say I wanted to grab all NASDAQ stocks, and then compare each to QQQ. For example, I could code the current price of each stock with a z-score (using a 30 day window), and determine if it is higher or lower than the corresponding z-score for QQQ, and by how much.

Morning Grant!
The pipeline API works just fine with minute backtests, however it does use daily data in it's calculations in those backtests. The way we think of it, you are using the least granular data to narrow down the securities, then you can get minutely data to make your decisions. We do not have plans to extend to minutely data at this time. It wouldn't be practical for performance reasons.

I'll work on an correlation and/or zscore example. I don't have one on hand, but know I have to do one. Hopefully today or tomorrow.

Disclaimer

Thanks Karen,

Sorta makes sense to use daily prices, although as Simon points out, if you have enough memory and computations can be vectorized, then maybe you could handle the additional two orders of magnitude of minutely data.

As complement to this new API, you might try to sort out how users can get data from the research platform to the backtester/trading API (e.g. https://www.quantopian.com/posts/linking-research-output-to-fetch-csv-input).

Grant

Nicola Cannata

Wow!!!!!
Finally a technical screener...

Thanks Quantopian

@Grant low and high (plus volume) for each day are available too. Not only close price.

@Grant @Simon

You can do comparisons by writing CustomFactor code that looks like this:


class MeanDifferenceWithSPY(CustomFactor):  
    """  
    Factor computing the mean difference between each asset and SPY.

    This isn't a particularly useful thing to do, but it's illustrative of how  
    to do comparisons against a specific column of data.  
    """  
    inputs = [USEquityPricing.close]

    def compute(self, today, assets, out, close):  
        # 8554 is SPY  
        spy_column = close[:, assets.searchsorted(8554)]

        # Subtract the price of SPY columnwise from 2D close array  
        # The fancy indexing on `spy_column` is telling numpy to treat the  
        # array as a (1 x N) column instead of a flat 1-D array.  
        subtracted = close - spy_column[:, np.newaxis]

        # Most numpy functions can write directly into an output array, which  
        # saves making an extra copy.  
        nanmean(subtracted, axis=1, out=out)

Adding more robust support for this sort of thing is on the roadmap for more development. There are a couple ways the engine can do this more efficiently for you if it knows what you want to do. Most notably, it can avoid doing the searchsorted call every day, though that's a pretty inexpensive call for the orders of magnitude we're dealing with:

In [1]: arr = arange(8000) # At any given time we tend to have between 8000 and 10000 assets in existence.

In [2]: %timeit arr.searchsorted(3000)  
1000000 loops, best of 3: 415 ns per loop

Disclaimer

Thanks Scott,

Pretty cryptic, unfortunately. I have to assume you know what you are doing, but it is mostly gibberish to me at this point. Is there a reason it doesn't look a bit more like Pandas/numpy or something I've seen before?

Grant

John Fawcett

Grant,

The code is all dealing with numpy arrays, and each manipulation is explained in the comments. Seems pretty clear to me.

Thanks,
fawce

Disclaimer

Sorry, got a little cranky. Some questions:

Presumably, this line needs to be written like this:

class MyFancyFactor(CustomFactor):

It can be called anything, but needs to have 'CustomFactor' as an argument.

This line must do something, but I don't see 'inputs' used anywhere:

inputs = [USEquityPricing.close]

And presumably, USEquityPricing.close is giving me closing prices for all US equities, but what is the data structure? Why is it in square brackets?

And then this:

compute(self, today, assets, out, close)

What do all of the arguments mean?

I don't see a window length? Why not? Does this mean it is looking back to around 2002 across all securities?

This presumably is a numpy array:

close[:, assets.searchsorted(8554)]

But what does assets.searchsorted(8554) do?

With a little work, I can probably sort out what close - spy_column[:, np.newaxis] does, but I don't understand the structure of close. It seems you are just subtracting two numpy arrays? Doesn't close also contain the SPY data, so you just get zeros?

nanmean seems simple enough, but why isn't it np.nanmean (while above we have np.newaxis)?

If you are using numpy arrays without the security column labels used in a DataFrame, how do you know which column is which? I thought it was a good thing to use Pandas DataFrames, such as is returned by history()? Generally, skimming over the documentation, there appears to be a pretty steep learning curve, very different from the Pandas-based history() API. Kinda daunting, but I'll learn by example, I suppose.

See, I don't actually understand (not that Scott's response wasn't clear...I just have some things to learn).

FWIW, I plan to use pandas inside these custom factors, until such time as I cannot. I believe all that one would need to do is pandas.DataFrame(close,columns=assets) and that would be enough to start working with them. I think the main limitation is that the output factors must be the same shape as the input data, so whatever you are calculating, it pretty much needs to be calculated for exactly every asset. If you want to make summary statistics, I think maybe the better place to do that is with pandas using raw factor data inside before_trading_starts. But, I haven't used this version of the API yet.

Scott - is today still just the last row, or is it the vector of dates of length window length?

Nicola Cannata

Karen,

is one pipeline only allowed per algorithm?
Or can I attach more than one, e.g. for exploiting different strategies to be used in different market condition?

Is the name mandatory?
Or can I access pipeline_output also by using a global variable (in the context)?

@Nicola

is one pipeline only allowed per algorithm?
Or can I attach more than one, e.g. for exploiting different strategies to be used in different market condition?

Currently only one pipeline is supported per algorithm. Names are part of the required API because we expect that we'll want to support multiple pipelines in the future.

Disclaimer

@Simon

Scott - is today still just the last row, or is it the vector of dates of length window length?

It's still just the last row. Switching it to be an array of datetime64[ns] is in the roadmap, but fixing this correctly requires a somewhat significant change to a Pipeline API core datastructure, which makes the (engineering cost / user benefit) ratio not great.

I do want to make this change sooner rather than later though, since it will be a breaking change in the API.

Disclaimer

Serge d'Adesky

Karen,

"The pipeline API is not currently available in research, Quantopian or IB paper trading or live trading."

Any idea when the timetable is scheduled to make this available? What is the rationale for making an api available only in Zipline? I'm sure there is one,
but it eludes me.

@Grant

I think many of your questions are answered in the API Reference for CustomFactor, but there should probably be a better explanation of how CustomFactors work in the prose documentation as well.

A few responses to specific questions:

Presumably, this line needs to be written like this:
class MyFancyFactor(CustomFactor):

It can be called anything, but needs to have 'CustomFactor' as an argument.

This is just the Python syntax for defining a subclass. When you create your own factor, you do so by creating a subclass of CustomFactor that implements a compute function meeting the expected interface. That interface is described in the API Reference as follows:

Users implementing their own Factors should subclass CustomFactor and implement a method named compute with the following signature:
def compute(self, today, assets, out, *inputs):  
On each simulation date, compute will be called with the current date, an array of sids, an output array, and an input array for each expression passed as inputs to the CustomFactor constructor.

The specific types of the values passed to compute are as follows:
today : np.datetime64[ns]  
    Row label for the last row of all arrays passed as `inputs`.  
assets : np.array[int64, ndim=1]  
    Column labels for `out` and`inputs`.  
out : np.array[float64, ndim=1]  
    Output array of the same shape as `assets`.  `compute` should write  
    its desired return values into `out`.  
*inputs : tuple of np.array  
    Raw data arrays corresponding to the values of `self.inputs`.  

The essential idea is that a CustomFactor is constructed with two pieces of state: inputs, which is a list of Column objects (e.g. USEquityPricing.close), and window_length which is an integer >= 1. Columns are essentially just fancy sentinel values that tell the underlying engine what data to provide to the factor, and window_length tells the engine how many rows of input to provide.

This line must do something, but I don't see 'inputs' used anywhere:
inputs = [USEquityPricing.close]  
And presumably, USEquityPricing.close is giving me closing prices for all US equities, but what is the data structure? Why is it in square brackets?

For factors like SimpleMovingAverage that are naturally parametrizable, it makes sense to require the user to provide inputs and window_length every time they construct a Factor. Thus SimpleMovingAverage provides no defaults, and you have to explicitly write something like SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10).

For factors likeVWAP it makes sense to provide a default for inputs, but not for window_length. When you construct a VWAP instance, you generally do so like this: vwap_10 = VWAP(window_length=10). There are also factors like RSI or Latest that provide defaults for window_length but not inputs.

All of this means that there should be a nice clean way to say "this particular factor should provide a default inputs list or a default window_length". The easiest way to do this with a CustomFactor is to declare it as a class-level attribute, which is what we're doing in the statement quoted above. The line inputs = [USEquityPricing.close] is defining inputs as a list of length 1 (the square brackets are just the syntax for list literals, ala [1, 2, 3]) containing only USEquityPricing.close.

Again, the important thing to understand here is that the column objects passed as inputs are just fancy sentinel values: they tell the engine what data to pass to your compute function when it's called each day. For each such sentinel value you declare in your inputs, your compute function gets passed a numpy array containing the requested data.

You can find the source of most of the builtin Factors in Zipline if you want to see some more examples.

close[:, assets.searchsorted(8554)]  
But what does assets.searchsorted(8554) do?

Say I have a CustomFactor defined as follows:

class SomeFactor(CustomFactor):  
    inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
    window_length = 5

    def compute(self, today, assets, out, closes, highs, lows):  
        # Do some stuff.

If I construct and add an instance of this factor to a pipeline by doing something like pipe.add(SomeFactor(), 'name'), then every simulation day my compute function will be called with a bunch of data:

today will be the day for which compute is being called.
closes, highs, and lowswill all be 5 x N numpy arrays, where N is the number of assets in our database on the day in question. There are 5 rows in all the arrays because we provided a default of 5 for window_length.
assets will be an integer array of length N, representing column labels for closes, highs, and lows.
out will be an empty array of length N. The job of compute is to write output values into out. We write values directly into an output array rather than returning them to avoid making unnecessary copies. This is an uncommon idiom in Python, but very common in languages like C and Fortran that care about high performance numerical code. Many numpy functions take an out parameter for similar reasons (I use this with nanmean in my example above).

The original question that prompted my example was asking how to do a computation comparing every asset to a particular known asset. Since we have a column for every asset in our database, what we want to do is get the index of the column corresponding to the asset we care about. For a sorted array (which assets always is), the fastest way to do this in numpy is searchsorted:

In [7]: import numpy as np

In [8]: x = np.array([1, 3, 5, 6]); x  
Out[8]: array([1, 3, 5, 6])

In [9]: x.searchsorted(1)  
Out[9]: 0

In [10]: x.searchsorted(5)  
Out[10]: 2

When we do spy_column = close[:, assets.searchsorted(8554)], we're saying "pull out the column of closes corresponding to the index of asset 8554 (which is SPY)". The result of this operation will be a 1-D array containing the last window_length closes for SPY. This doesn't mutate close in any way; it's still a 2D array containing a window of closes for all the assets we know about. When we then do:

subtracted = close - spy_column[:, np.newaxis]

we're saying "create a new 2D array containing the results of every column in close minus the closes of spy. The fancy indexing on spy_column is telling numpy to treat spy_column as a 1 x N column array. If we didn't do this, we'd get an error from Numpy telling us that our arrays are the wrong shape:

In [13]: data = arange(15).reshape(5, 3); data  
Out[13]:  
array([[ 0,  1,  2],  
       [ 3,  4,  5],  
       [ 6,  7,  8],  
       [ 9, 10, 11],  
       [12, 13, 14]])

In [14]: first_column = data[:, 0]; first_column  
Out[14]: array([ 0,  3,  6,  9, 12])

In [15]: data - first_column  
---------------------------------------------------------------------------  
ValueError                                Traceback (most recent call last)  
<ipython-input-15-574b6935bc49> in <module>()  
----> 1 data - first_column

ValueError: operands could not be broadcast together with shapes (5,3) (5,)

In [16]: data - first_column[:, np.newaxis]  
Out[16]:  
array([[0, 1, 2],  
       [0, 1, 2],  
       [0, 1, 2],  
       [0, 1, 2],  
       [0, 1, 2]])

nanmean seems simple enough, but why isn't it np.nanmean (while above we have np.newaxis)?

This is just how I happened to write the example. If you had only done import numpy as np, then you'd have to do np.nanmean.

I'm going to adapt some of this response and add it to the main prose docs for the Pipeline API.
Hope this helps,
-Scott

Disclaimer

I also highly recommend reading the Numpy docs on indexing as a starting point for getting familiar with numpy.

Over time I've found myself using pandas less and less and using numpy more and more, especially when I'm doing purely numerical work.
Pandas is really convenient for IO (e.g. reading data from CSVs or SQL databases), and it's good at "data science"-ey tasks like resampling and group-by operations, but it incurs a lot of overhead trying to figure out what you mean and it's got some strange behavior in a lot of corner cases. (See, for example, this bug fixed by one of our engineers this week.)

Disclaimer

@Serge

"The pipeline API is not currently available in research, Quantopian or IB paper trading or live trading."
Any idea when the timetable is scheduled to make this available? What is the rationale for making an api available only in Zipline?

This sentence was a bit poorly worded. The Pipeline API is available in the Quantopian backtester right now. It's not available in Quantopian paper trading yet. We want to get feedback on the API from folks using it in the backtester so we can make any needed tweaks before allowing it in live trading. That said, we plan on making it available in both research and paper trading Soon (TM).

Disclaimer

Serge d'Adesky

Ah, ok. That makes more sense.

Thanks

Thanks Scott,

Your response moved me up the learning curve. I'm o.k. with numpy, since it is basically Python's version of MATLAB.

The idea is to output an array of length N, with labels given by assets (i.e. a scalar for each asset is output). However, it seems that a hack could be devised to re-purpose the labels and then one could output any ol' data, so long as the total number of elements is not more than N. For example, if N >= 18, I could output 3 vectors, each of length 5 using the first 15 elements. Then, for next 3 elements, I could record labels for my 3 vectors (e.g. the sid IDs for the vectors). Would something like this work?

@ Karen,

Why did you decide to limit this to daily bars? If I understand correctly, computations would not need to be done over the entire 8000+ security database. Say I wanted to look at 100 stocks over a window of 3 days, using minute bars. So, I'd have 3x390x100 = 117,000 data points. This is compared to 30*8000 = 240,000 data points for a 30 day window over all stocks. Is the idea that history() could be used in before_trading_starts() to do computations over minute bars?

Also, within before_trading_start(), is there a way to catch a time-out error? Or will the algo just crash?

I don't understand "In backtesting, they are calculated in bulk, once per year." Is this done before the backtest even starts? Or will the backtest pause every year as it runs, to do the computation? And is the computation limited to 5 minutes? Or could it take longer? In backtesting, is there a way to turn off the pre-computation, to verify that the algo will run live?

Lucas Silva

Nice addition, at some point it would be nice to have it working asynchronous like sockets do so we are not limited by this 50secs time-out. Should have a queue and check if there is data if not just move one. So the factor could "Miss" some bars while computing.

This time-out is kinda dangerous

Deleted User

Hi,

I want to screen stocks for pair trading based on 15 minute interval data. Does pipeline only support daily prices or can it handle minute data as well?

Best regards,
Pravin

John Fawcett

Hi Pravin - pipeline is daily only.

Disclaimer

Hi Fawce,

Why is it daily only? And do you have plans to expand it to minute bars? As I commented above, it would seem to be a matter of setting a limit on the number of data points, rather than daily versus minutely.

Grant

tom appleton

Hi, great work guys !

Is it possible to combine normalized factors (by normalized I mean take the value and normalize across all securities in my universe to between 0 and 1)?

For example say I have a weighting system that says weight= (0.3*volatiltyFactor) + (0.7*fundamentalFactor). Or does the rank function already normalize ?

Thanks

tom appleton

heh, so I cloned it to see how it works, and rank is literally the rank in the universe. So essentially it is the normalized value. Cool.

How could one get different time horizons for the fundamentals data?

As far I know, the fundamentals data always attain to the most recent quarter.

For example, how can I write a factor that takes care of the Total Revenue of the last 12 months?

My first thought was the following approach (assuming a quarter is composed by 63 days) but I'm not sure, it's very reliable.

class RevenueTTM(CustomFactor):  
    inputs = [morningstar.income_statement.total_revenue] 

    # 252 trading day in a year  
    window_length =252  
    def compute(self, today, assets, out, revenue):  
        # Assuming a quarter is 63 days  
        out[:] = revenue[0] + revenue[62] + revenue[182] + revenue[251]

The main risk is to sum the same quarter two times.
There must be better way! Any suggestions?

Jamie McCorriston

Hi Costantino,

You're definitely on the right track!

Take a look at the algo I posted here: https://www.quantopian.com/posts/fundamental-history-based-algo#5616e27ea8e03d7580000155.

This algo compares the EPS from the last quarter to the current one. You could apply a very similar logic to your algo to get the 4 most recent unique prices. This should take care of the edge cases you are worried about!

Disclaimer

Thanks Jamie!

It's quite an hack, but for sure a more robust solution to the problem... I hope, the API will soon manage to handle such cases easier :-)

I'll post my algorithm for the TTM, when finished.

@Grant

The main reason that Pipelines don't support minutely data is that we depend on pre-fetching data for reasonable performance during backtesting, and the memory overhead of doing that pre-fetching is considerably worse when working with minutely data.

When you run a Pipeline that uses a 90-day trailing window, we don't just query the last 90 days every day, since that would be glacially slow because you'd spend almost all of your backtest time doing disk/network IO. Instead, what we do is load all the data we think we'll need to run your pipeline for the next ~252 days (the number of trading days in a year). We use that cache to pre-compute your entire pipeline for a year and then feed the results to your backtest on the dates when it would become available. This makes rank()computations in particular much faster, because we can pre-compute ranks for an entire year in Cython on one big block of contiguous memory.

In the simplest possible case of window_length=1, we load data in ~yearly blocks, and an optimistic lower bound on the memory overhead of a Pipeline is about 16 MB per requested column:

In [1]: from operator import mul

In [2]: reduce(mul, [  
252, # days  
8000, # assets per day  
8, # bytes per data point  
])
Out[2]: 16128000

In [3]: from humanize import naturalsize

In [4]: naturalsize(_2)  
Out[4]: '16.1 MB'

This means that we incur at least 16 MB of memory overhead per column when operating on daily data. In practice the actual number is probably closer to 20 or 30, but 16 is a reasonable lower bound. These values are well within the tolerance of acceptable overhead on top of an algorithm writer's business logic.

Suppose now that Pipelines supported minutely data. If we wanted to get the previous day's minutes in our pipelines each day, we'd have a window_length of 390 (the number of market minutes each day). Notice, however, that there's no overlap in the trailing windows we see each day: every day requires a totally independent set of 390 bars. This means that our lower bound calculation becomes:

In [7]: reduce(mul, [  
   ...: 252, # days of pre-cached data  
   ...: 390, # rows of data per day  
   ...: 8000, # assets per day  
   ...: 8, # bytes per data point  
   ...: ])  
Out[7]: 6289920000

In [8]: naturalsize(_7)  
Out[8]: '6.3 GB'

6.3 GB per column is well above what we can support, and that's hit just querying a day's worth of minutely data (also note that this means we're doing 6.3 GB of network or disk IO per backtest-year, independently of any precomputation strategies. That will be slow even if we're not paging 3/4 of our RAM to disk because we're almost out of memory.)

The obvious response here is to do less precomputation when working with minutely data. It's possible that that could be made to work, but at that point you're just querying for a trailing minutely window every day, in which case you might as well just use history().

Speaking more philosophically, there will always be tension between wanting to look at data for lots of assets and wanting to look at data with high resolution. In the presence of CPU and memory constraints, you can either work with data for lots of assets at low resolution, or you can look at data for fewer assets at higher resolution. Up until now on Quantopian, we've essentially forced our users to choose "look at data for (comparatively) few assets at high resolution". The Pipeline API, in my mind, is essentially about providing an option to access the other end of the spectrum. Note that you don't have to choose just one of these options for your entire algorithm: you can (and should!) use the Pipeline API to do a wide screen that identifies assets that you want to investigate more closely, and then use history to look at those particular assets more closely. There's a bunch of exciting work happening now on the Zipline core to make this pattern more efficient.

Disclaimer

Elisabeth Daley

Hi, I'm just getting started with Quantopian, and am trying to get a very minimal Pipeline algorithm to work, but it doesn't appear to ever order anything. I'm sure I'm missing something basic, can someone help me out?

Thanks, this is a great site, and I love the Pipeline structure.

# Trying to get the minimum Pipeline algorithm to run.  
# This example uses pipeline to set its universe daily.

# Import the libraries we will use here  
from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline  
from quantopian.pipeline.data.builtin import USEquityPricing  
from quantopian.pipeline.factors import SimpleMovingAverage  
import math

# The initialize function is the place to create your pipeline and set trading  
# conditions such as commission and slippage.  
def initialize(context):  
    # Set execution cost assumptions. For live trading with Interactive Brokers  
    # we will assume a $1.00 minimum per trade fee, with a per share cost of $0.0075.  
    set_commission(commission.PerShare(cost=0.0075, min_trade_cost=1.00))  
    # Set market impact assumptions. We limit the simulation to trade up to 2.5%  
    # of the traded volume for any one minute, and  our price impact constant is 0.1.  
    set_slippage(slippage.VolumeShareSlippage(volume_limit=0.025, price_impact=0.10))  
    # Rebalance every Monday (or the first trading day if it's a holiday).  
    # At 11AM ET, which is 1 hour and 30 minutes after market open.  
    schedule_function(rebalance,  
                      date_rules.week_start(days_offset=0),  
                      time_rules.market_open(hours = 1, minutes = 30))  
    # Create, register and name a pipeline.  
    pipe = Pipeline()  
    attach_pipeline(pipe, 'min_pipeline_example')  
    # Construct Factors.  
    sma_10 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=10)  
    sma_30 = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=30)

    # Construct a Filter.  
    prices_under_5 = (sma_10 < 5)

    # Register outputs.  
    pipe.add(sma_10, 'sma_10')  
    pipe.add(sma_30, 'sma_30')

    # Remove rows for which the Filter returns False.  
    pipe.set_screen(prices_under_5)

# Called every day before market open. This is where we access the securities  
# that made it through the pipeline.  
def before_trading_start(context, data):  
    # Pipeline_output returns the constructed dataframe.  
    context.output = pipeline_output('min_pipeline_example')  
    context.long_list = context.output.sort(['sma_30'], ascending=False).iloc[:100]  
    # Update our universe to contain the securities that we would like to long.  
    update_universe(context.long_list.index)  
# This rebalancing is called according to our schedule_function settings.  
def rebalance(context,data):  
    # Set the allocations to even weights in each portfolio.  
    weight = 1 / len(context.long_list)  
    # For each security in our universe, order positions according  
    # to our context.long_list, and sell all previously  
    # held positions not in the list.  
    for stock in context.long_list.index:  
        if stock in data:  
            order_target_percent(stock, weight)  
    for stock in context.portfolio.positions.iterkeys():  
        if stock not in context.long_list.index:  
            order_target(stock, 0)  
    # Log the orders each week.  
    log.info("This week's orders: "+", ".join([long_.symbol for long_ in context.long_list.index]))  
# The handle_data function is run every bar.  
def handle_data(context,data):  
    # Record and plot the leverage of our portfolio over time.  
    record(leverage = context.account.leverage)  
    print "Long List"  
    log.info("\n" + str(context.long_list.sort(['sma_30'], ascending=True).head(10)))

Elisabeth Daley

Nevermind - looks like coercing the weight variable used in order_target_percent to a float type does the trick. thanks anyway!

@Elizabeth

Looks like the issue is here:

weight = 1 / len(context.long_list)

Integer division in Python 2 (which is what Quantopian runs on) truncates toward negative infinity by default:

In [3]: 5 / 2  
Out[3]: 2

This means that 1 / N == 0 for any integer N > 1, so you're always ordering with a weight of 0.

In Python 3, this behavior was changed to produce float values instead:

Python 3.4.3 (default, Jul 28 2015, 18:20:59)  
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.  
>>> 5 / 2  
2.5

You can get this behavior in Python 2 by putting from __future__ import division at the top of a file:

Python 2.7.10 (default, Jun 30 2015, 15:30:23)  
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.  
>>> from __future__ import division  
>>> 5 / 2  
2.5

If you don't want to disable this behavior globally for whatever reason, you can also divide by a float instead of an int, which will behave as you expect:

Python 2.7.10 (default, Jun 30 2015, 15:30:23)  
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.  
>>> 5.0 / 2  
2.5

Disclaimer

Ha...you fixed the issue before I finished typing a reply.

Disclaimer

Elisabeth Daley

Thanks anyway :) Good to know about it running on Python 2.

@ Scott,

Thanks for the explanation. So are you planning to make history() available in before_trading_start()? Also, am I mistaken, or does before_trading_start() not run on the first day of a backtest?

Grant

Is it explained somewhere how the factor computations deal with security start and end dates? I understand that "Assets will be masked out the day after their end date." as Fawce explained on https://www.quantopian.com/posts/dollar-volume-pipeline. But what about stocks that have their start dates within the window? How are they managed, as the backtest runs? It sounds like they make it though to update_universe, but is it then up to the author of the CustomFactor to sort out how to deal with securities that don't fill the entire window?

PaulB

Hi, been working with quantopian for a long time now, and slowly improving my coding.

This pipeline is fantastic.
A basic question, if we want to use the screen to select the top 2000 stocks (an initial universe) the create a factor (6 month share price growth) for example, take say the best 20% of these, then rank this subset against another factor say (price to book).

What's the most efficient way to code these?

I can see from the examples how to achieve the first two steps easily, but not sure how to use the output of the this with the third.
Paul

Hi PaulB,
Here is an example that I think shows what you are trying to do. There are two key things to making this happen. The first is the percentile between function which allows you to easily find the top 20% of any factor. The second is the mask option, to mask out the stocks you don't care about.

I ranked the results by the pb_rank and then invested in the top 200 (reinvesting monthly) but you could do a lot with the information provided in the results.

Let me know if you have any questions.

Disclaimer

PaulB

Thanks very much Karen. Very helpful!!

First, Thank you for this Pipeline API, which opens door for stocks screenings.

What else can we import from quantopian.pipeline.data.builtin ?
I want to access the SPY closes but couldn't using USEquityPricing[index].close where index is the SPY sid

I want to calculate the beta coefficient for each stock for the last X days, but I have to access the SPY closes first,
so I can compute the the beta with covariance of returns of both stock and SPY / variance of SPY index

Luca

To access the close price:

Add this to your the pipeline in initialize (and make sure to not exclude SPY from pipeline results when screening):

price  = Latest(inputs=[USEquityPricing.close])  
pipe.add(price, 'close_price')

In before_trading_start do the following:

output = pipeline_output('pipeline')  
context.pipeline_price =  output['close_price']

In handle_data or before_trading_start you can access the price like this:

context.pipeline_price[index]

Please note that the close prices returned from Pipeline will be "as-traded" for that day (see this).

Luca,

As mentioned above, I want to access the SPY closes whithin the initialize function to calculate beta for stocks , not in handle_data

Thank you

Adham,
Take a look at Scott Sanderson's first comment in this thread above ^^ he shows how to do this.

Disclaimer

I did, although I couldn't understand the syntax he used I tried it and I got this message:
SecurityViolation: 0002 Security Violations(s): Accessing np.newaxis

@Adham

It looks like numpy.newaxis wasn't whitelisted in our security sandbox. I've updated the security definitions to allow newaxis, but it takes a few days for that change to propagate to the main site.

In the meantime, numpy.newaxis is actually just an alias of None, so you can use None anywhere that you'd use newaxis.

In [11]: from numpy import arange, newaxis

In [12]: x = arange(5)

In [13]: x  
Out[13]: array([0, 1, 2, 3, 4])

In [14]: x[:, newaxis]  
Out[14]:  
array([[0],  
       [1],  
       [2],  
       [3],  
       [4]])

In [15]: x[:, None]  
Out[15]:  
array([[0],  
       [1],  
       [2],  
       [3],  
       [4]])

(See the numpy docs) for details.

Disclaimer

Blue Seahawk

Adham, if you have beta calc code, I'd love to see it. This was mine so far. For anyone interested, here's finviz beta (on the Tickers tab you could copy those in order to quickly see in current code what effect a mask for low beta stocks might have).

Sure, I will post it when I finish coding it

below code returns the column of closes for the passed sid =SPY, how can I get the current sid the pipeline is calculating so I can get that sid closes column.?

spy_column = close[:, assets.searchsorted(8554)]

current_stock_column = close[:,assets.?] # what is the method ?

As I recall, there is a general Quantopian rule that a backtest has to be repeatable (run 1 must return the same results and run 2, for the same capital & start/end dates). Would this rule out random sampling of securities using pipeline? Or is there a work-around by using a fixed random number generator seed? Just thought I'd ask in case there is a "gotcha" before I try it.

One thing I would like to see: a good way to use the pipeline on data taken at intervals greater than a day, e.g. every month for several years. I'm trying to build an algorithm using pipeline plus the long-term performance of certain assets, and I keep getting time-out errors.

Yes, I've the same timeouts problems when I try to extract quarterly fundamental data.
More details in this post:
https://www.quantopian.com/posts/fundamental-history-based-algo#561fb3b1bf797603590006ac

Luca

+1 for Grant's question: contest rules regarding random sampling stocks.

Minh

Do I need to call set_universe before using pipeline to construct my own universe of securities?

One thing I'd really like to be able to do: calculate pipeline-style factors on manually chosen securities, for example, using the dividend-adjusted price history of bond ETFs to estimate the risk-free rate of return. In general, having dividend-adjusted prices seems like a big deal and our access to that data should be as flexible as possible

+100 to this. Dividend-adjusted prices are really important, but it's tough to use them only in Pipeline. I would really like to fit models on dividend-adjusted one-minute data.

Do we have an ETA for when it will be possible to use Pipeline in the contest? I've spent much of my free time in the past week working on a Pipeline-based algo that I'm quite excited about, and I'm itching to enter it in the contest.

Hi Christopher,
Glad to hear you have something you are excited about. We have people actively working on getting the pipeline into paper trading. I expect it's on the order of a few weeks away. Around that point you will be able to submit contest entries using it.

We are also working on dividend adjusted pricing throughout the backtester. That one is a little further out, but we are also actively working on it. (Research will come after that.)

I'd love more information about wanting to use the pipeline on manually chosen securities. I've heard this a few times, and was surprised. Is the key reason that you want to do ETFs vs companies? Or is there another reason why the pipeline filtering process doesn't work for you?

Thanks,
KR

Disclaimer

Regarding timeout issues that a number of you have mentioned, we released some performance improvements this morning that should help. Please let me know if you continue to see issues.

Disclaimer

Nicola Cannata

Hi Karen,
so you confirm that for next contest deadline on November 2, algorithms using Pipeline will not be accepted?

Okay, so I figured out that when creating custom factors, because "assets" is an array of IDs, it's easy to compare those IDs against a list of hand-picked IDs to create a factor that is equal to 1 if the ID is in the list, and 0 otherwise. Currently, my algo uses this to build a list of securities that's mostly selected by dollar-volume, but will always include SPY, SHY, and TLT (and will have certain other custom factors computed for those three ETFs, as well as the stocks selected by dollar-volume).

Nicola,
Pipeline algos will be accepted in the contest as soon as they work in paper trading. We are working on that as fast as possible. (We want it too!)

Disclaimer

Unfortunatly I still get a Timeout Error, when I try to extract single quarter fundamental data.
Here I'm attaching the notebook for the algorithm and in the next post the backtest...

and here is the algo+backtest (please set as CustomFactor RevenueTTM3 at line 122).

My goal is to extract the single quarterly entries for a fundamental dataset set with window_length = 252.
This is necessary for example, if you need annual fundamental data oder quarterly changes in the last 12 months.

Thanks for posting this algo Constantino. We'll look into the timeout issue.

Disclaimer

lets make america great again

Hi Karen,

to reproduce the Timeout, do not forget to set RevenueTTM3 at line 122 as CustomFactor.

My goal is to get Trailing Twelve Months (TTM) fundamental data to create a Custom Factor for computing the Piotroski and Beneish Score.

If there is a more convenient way to get TTM data in a Pipeline factor, let me know... I participated also to this discussion
https://www.quantopian.com/posts/fundamental-history-based-algo#561fb3b1bf797603590006ac
and it seems, there is at the moment no other simpler way to do this.

Thanks!
Costantino

How does one filter out chinese stocks using the pipeline?

Suminda Dharmasena

I am looking to get something like:

Top 20% or top 20 from the pipeline
Bottom 20% or bottom 20
Between 40% to 75% between 100 and 150 in rank
10% +/- or 20 items around from value x
etc.

In short top or bottom n% or m items, items between n%, m% or rank, % or number of items around a given pivotal value.

Is it possible to add an example on how this can be done.

Also how to successively apply a filter. E.g. apply filter 1 and on this universe apply filter 2 followed by filter 3.

With the new APIs is it possible to add better auto completion and the ability to discover the white listed APIs from within the IDE.

Adam Smith

The pipeline looks perfect to me.

My question is whether pipeline API is eligible to implement in IB paper trading or even real money trading. If not, when is the time frame we can expect it comes?

Jamie McCorriston

Hi Adam,

Currently, pipeline is supported in backtesting and Quantopian (zipline) paper trading. We are currently working on bringing it to IB paper and real money trading. There's no specific timeline but it's at the top of our list!

Disclaimer

Karen, I initially thought pipeline overcomplex, but the more I use it, the more I appreciate it!

Anyway, I note that several of the built-in factors use EMA. I assume this is it to reduce the computational burden in backtests: these can be calculated based on 1 days history and 1 multiplication, whereas SMA require the entire window_length history and multiplications. Is there a special implementation trick you've used for this, or do I just use a window_length of 1.

EDIT: specifically I am trying to implement ATR using the conventional Wilder's EMA. At any point in the time, you need all the historic high low close prices to calculate the next ATR, but the shortcut is to use the previous day's ATR and todays high low close. Is it possible to get yesterday's ATR in a CustomFactor?

Richard Jamieson

Hi Karen,

I think that my query may be similar or at least related to Dan's above. I am trying to build a custom factor using the Williams Percent R method from talib. Here is my code:

from talib import WILLR   

# Create custom factor to calculate Williams %R  
class Williams(CustomFactor):  
    # Pre-declare inputs and window_length  
    inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
    window_length = 10  
    # Compute value

    def compute(self, today, assets, out, close, highs, lows):  
        #calculate Williams percent R for the stock  
        out[:] = WILLR(highs,lows,close, timeperiod=10)

However, when I try to run this I get the following error: AssertionError: high has wrong dimensions
There was a runtime error on line 101.

Any idea where I'm going wrong?

I think WILLR probably wants a float, rather than a vector. You have to do some looping or a map operation. If you search the community for ATR you will see something similar.

Or it may be it wants a time series for one ticker, but you are passing a time series for multiple tickers.

Richard Jamieson

Thanks Dan - I think it's the second thing you mention - am working on a solution now, will post if I nail it...

talib's function signatures are a bit awkward to use in Pipeline right now. There are three major issues with them that I'm aware of:

TALib's functions assume that they're receiving all the data at once, and they perform rolling computations in their inputs.
TALib's functions only work on one asset's worth of data at a time.
Many TALib functions don't robustly handle NaNs. Since Pipeline often provides columns with leading NaNs (e.g. for a company that's only existed for 5 days when you've requested a 10 day window.), you have to manually ignore columns containing NaNs.

As a concrete example, the WILLR function mentioned here has essentially the following signature/semantics:

# The actual docstring for WILLR is much less informative than this :(.  
def WILLR(high, low, close, timeperiod=14):  
    """  
    Parameters  
    ----------  
    high : ndarray[float]  
        1D array of historical highs for a single asset.  
    lows : ndarray[float]  
        1D array of historical lows for a single asset.  
        Must be the same length as highs.  
    close : ndarray[float]  
        1D array of historical closes for a single asset.  
        Must be the same length as highs.  
    timeperiod : int, optional  
        Lookback window over which to compute the transformation.  
        Default value is 14 periods.

    Returns  
    -------  
    out : ndarray[float]  
        A 1D array of the same length as the input arrays.

        The first `timeperiod` entries will be NaN.  The remaining entries  
        contain the result of computing William's %R on a trailing window of  
        `timeperiod` entries.  
    """

If we want to use this in Pipeline, we have to jump through a few hoops:

We have to call it roughly 8000 times, once for each asset column.
We have to call it on a window of length timeperiod and then extract the last entry. (I haven't looked much into the underlying TA-Lib implementations to see whether this is a significant performance hit.)
We have to manually screen out columns containing NaNs if the TALib function we want to use doesn't support them effectively.

I've attached a notebook with a discussion of the issues and a full example showing how I'd do this in Pipeline.

Disclaimer

Generally the calculations for indicators are pretty simple, and can probably be implemented in Python using NumPy using vector and matrix operations. This would probably be 100x faster than looping through thousands of TALib calls.

Generally the calculations for indicators are pretty simple, and can probably be implemented in Python using NumPy using vector and matrix operations. This would probably be 100x faster than looping through thousands of TALib calls.

This is definitely true, though the flipside of this is that most Pipelines using anything but USEquityPricing data are currently bottlenecked on IO. In general, it's a bad performance practice to be iterating over lots of numpy arrays in pure python, but there's some value in being able to quickly pull in a TA-Lib function to see if it's even worth rewriting for performance.

Disclaimer

Richard Jamieson

@Scott - I was hitting up against the exact problems you mentioned, so thank you for providing some clarity as to why
@Dan - I went with your suggestion and started from scratch using vector and matrix operations and came up with this:

class Williams(CustomFactor):  
    # Pre-declare inputs and window_length  
    inputs = [USEquityPricing.close, USEquityPricing.high, USEquityPricing.low]  
    window_length = 10  
    # Compute value  
    def compute(self, today, assets, out, close, high, low):  
        #calculate Williams percent R for the stock  
        highs = np.amax(high, axis=0)  
        lows = np.amin(low, axis=0)  
        out[:] = (highs-close[-1])/(highs-lows)*-100

I haven't actually checked the numbers it's producing against a second data source, but just eyeballing them they at least look like Williams %R numbers.
Finally @Dan - any thoughts on the speed of this approach? Should be pretty quick unless np.amax and np.amin slowing things down?

Looks pretty pythonic to me :)