Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
universe selection

Hi,

A member sent us advice via the private feedback link, but said we could share an anonymized version with the community. I'm going to post our correspondence here as individual messages.

enjoy,
fawce

==================original comment================

The setup is pretty good as it stands.

There is one feature missing that is a showstopper for serious quantitative work, universe selection. Of course one can define an algorithm that does well on momentumish stocks and run it on AAPL and GLD and it'll make a ton of money. The problem is knowing AAPL and GLD are momentumish before they became momentumish. This is often the most insidious form of lookahead bias. For serious backtests what is usually done is that the universe is selected based on something quantitative. So for example every day you can update your universe to be everything on the NYSE, Nasdaq and AMEX with a mkt cap above the median or some such. An alternative is to use all members of an index along with changes in the composition of the index. The second option may be a bit expensive because index composition data can be expensive due to demand from index arb desks. One may also want to filter based on whether a security is a stock or ETF, ADR or domestic stock etc. For a momentumish strategy of course you can further filter the universe every day based on some way of detecting whether a stock is momentumish or not.

There are a few other issues. What do you do if a security doesn't trade for 5 minutes yet at the end of every minute an order is triggered? Do you leave the orders unfilled, or fill them somewhere. These sorts of issues can also lead to misleadingly good backtests.

Fantastic start.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

14 responses

Hi,

The issue of screening has come up several times in the first two weeks following our launch, and we've been debating how to attack the problem. Your message is such a clear statement of the need and the purpose of universe selection, it is really helping us think through a solution. A pretty simple start would be to introduce a method available inside initialize for select_universe, which would in turn take arguments defining the criteria for the universe. At the moment we only have market/pricing information, so we'd have to stick with filters on volume and price. I was hesitant to do something with such limited data, but now we are all realizing that we need to acknowledge the need for universe selection and establish at least the direction we'll take in solving it.

Regarding orders, we only fill orders when there is a trade. If an order is placed and can't be filled, we hold the order in an open_orders table until the close. On each trade we consult the open_orders table to see if there is an orders for the security in the trade. If there is, we use a very simple model to calculate a fill for the order. We allow partial fills, in which case the filled portion is printed as a transaction and the unfilled portion is placed on the open_orders table.

I really can't thank you enough for such a cogent explanation of the screening issue. I love the line "This is often the most insidious form of lookahead bias". That will definitely stick with us.

Thank you again for the feedback and encouragement, we will keep at it!

thanks,
fawce

If you only have price/volume information may I suggest the following. For each stock for the past t days calculate the n'th quantile of volume. Let's call the n'th quantile of volume for stock s it's v_value. Then in cross section on the trading day calculate the m'th quintile of the v_values of all stocks in the cross-section. Filter out all stocks for whom the v_value is below the m'th quantile value. Allow people to specify the lookback number of days, the quintiles desired and that this can be done with dollarvolume ie sum(volume*price) for a day and log(dollarvolume). For filtering share volume is non-ideal given differing prices.

Average, minimum or maximum isn't great either. There have been days when SP500 stocks don't trade at all due to news related halts and days when tiny caps trade with huge volume. Average is too easily skewed by a single high volume number. You will need to consider how to prevent hysteresis on the edges because a lot of stocks on the bubble with come into the universe and leave every day. Some possible solutions to the hysteresis problem are to require a stock that leaves to stay out for at least 20 days or require a stock that enters to stay in for at least 20 days or some such.

A lot of serious quant work with daily or periodic (say 30 minute or weekly) volumes uses the transform log(max(1+epsilon), dollarvolume) for volume because log volumes work best and fast ways of doing fits to multivariate distributions of volume (volumes across stocks are correlated) to distributions that make sense blow up with log(volume) < 0. Epsilon is generally chosen to be large enough to ensure numerical stability. You may want to consider allowing that transform since a lot of quants have a better feel for where they want to set filter limits based on that transform because that's the number they look at all the time.

Thanks and best regards. Glad I could be of help.

Thank you again! The recommended approach makes a ton of sense, and it preserves a reasonable amount of control for the member doing the screening. I'm especially interested in the problem of stocks coming and going too frequently.

For the epsilon transform, I think there might be a missing parenthesis in your original and I want to make sure I have it right:

log(max( 1+epsilon, dollarvolume))

The idea would be to transform all stocks dollarvolumes per day this way, and then perform the screening based on the quantile method you outlined, but applied to the output of the transform?

Are there any books or good online sources to get more familiar with what you guys are discussing?

fawce - do you allow cancelation of partial orders that weren't filled after the first minute?

@Vishal, unfortunately no we do not (yet). Cancellation combined with a way to check on currently open orders would be an awesome extension to zipline.
I think the enhancements needed to orders are:
- cancellation
- open order checking
- limit orders
- different time to live values - good til canceled, fill or kill, etc.
- any others you'd like to see?

Based on the feedback so far, we are focused on providing daily bars as the next major enhancement. Daily bars will make broad screening for universe selection feasible - running hundreds or thousands of evaluations over minute data would be very computationally intense. Our minute history is approximately 7.5 billion records for about 17k securities, so daily pricing should be on the order of only 20 million records. That should be manageable for screening.

@Peter, I was so happy to get the advice above because it is from a real practitioner. I think statistical texts will help with the concepts, but I don't know if there are many places where you can find this kind of practical advice.

What about a way to just trade stocks were part of an index at a certain time. For instance, let's say that you want to define your trading universe as the Nasdaq 100 and backtest with the makeup of this index dynamic changing over the life of the back test.

@Karl, it is an excellent approach, and I think what the original post meant by "An alternative is to use all members of an index along with changes in the composition of the index." Suppose we provide just such a filter, and we have the Nasdaq 100 stream through your handle_data method. We have a few questions:

  • Does the API need to change at all?
  • Would you want any additional methods on the data parameter to handle_data?
  • Would you want a handle_filter_changed or handle_index_changed methods that we call as members are added/removed?

I am thinking that this could be as simple has defining a function that takes a date as an argument and returns an array with the members of the Nasdaq 100 index as of that date.

Once you had that array you could iterate through it and perform what ever further functions you wanted on the members of that array.

Draw back of this approach would be that it would be expensive in terms of calls being made.

At first blush, I think of universe selection as essentially the same problem as database selection, i.e., select some subset of the data based on the stated criteria. As such, my instinct, if I were building the api, is to use the same sort of interface that is used in modern ORMs. It is a well-defined, well-tested, and repeatedly refined interface that seems to work passably well over a variety of conditions and systems. Somehow, in the initialization function, you state the criteria and the time series data for the stocks that meet the critera are pumped through the handle_data method. E.g., as a user, I'd like to say

Universe.filter(index='NASDAQ') +cap__gt=whatever + cap_lt=whatever )

or something like that.

@Joe: yeah, there are some websites with really pretty interfaces that let you select stock in that manner. When we were planning the universe selection we were definitely contemplating whether we should try and recreate this. Turning this into an API as you propose is also definitely appealing.
However, I see two problems with this approach:

1) Computational burden: Running this universe selection on the whole DB (i.e. ~8k unique sids) for every event/day for every backtest would require a lot of processing.
2) Lack of flexibility: there is an infinite amount of criteria one could choose to select a universe. So we might end up adding more and more filters while we really just want to give our users the option of doing that themselves.

Having said that, I think with our current approach you can still be very flexible and operate on a large universe. After you selected a universe of a couple of e.g. 100 sids (to make it more managable) by their trading volume you can then add a batch_transform that receives a pandas DataFrame of all those sids and their price history of e.g. the last 100 days. There, you can let your creativity run wild and code any selection criteria you'd like.

As a more general point, it is a constant question for us to decide between ease-of-use and flexibility. The way I see it, there are quite a few websites doing some basic things without requiring you to code (but what if you want to do something the designers have not thought about?) but none that give you the flexibility and freedom Quantopian offers.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Well, I agree there is a tradeoff, and as a long time coder I prefer flexibility of course. And I wasn't actually assuming all the filters would be provided: merely that there was a mechanism that could perhaps be extended to provide your own on top of the provided ones.

I'm still a bit confused by what you mean by "After you selected a universe of a couple of e.g. 100 sids (to make it more managable) by their trading volume". Is trading volume the only way or was this just an example of one of the ways? If, for example, I want to run a backtest that operates on the top 10 NASDAQ stocks by market weight, would that be doable?

Hi @Joe,

Thanks for all your input! This is a tricky area of api design given the tension between ease and flexibility, so I really appreciate the fresh ideas. I do like the concept of a declarative style interface, like you would find in an ORM. Let me clarify our current thinking a bit, and come back to that idea.

In our next release, we'll be introducing two new features. One is universe selection, and the other is batch transforms. Together these are designed to let your algorithm choose the securities to include in the portfolio at runtime, rather than forcing you to decide at coding time. The idea is to split the problem in two, and maximize one part for simplicity (universe selection) and the other for flexibility (batch transforms).

Universe Selection

Our historical database has pricing for approximately 15,000 instruments. They range from esoteric ETFs, to large cap house-hold names, to ADRs, to penny stocks. All of these instruments are not relevant for all algorithms; it would waste your time as a user to stream all the instruments through every algo. We want to provide a way to very easily winnow the securities fed to your algorithm down to a list of securities that are applicable to your algorithm.

The selection process is, as you've suggested, pretty data oriented. You want to be able to choose securities based on properties like market cap or liquidity or fundamentals. At the moment, we have just trade history. Our first implementation will only provide a "dollar volume" screen as described above - you will be able to specify the percentile range for v_value rankings. However, we will try to define the API in such a way as to allow for future expansion.

The current plan is to provide a magic function, set_universe, which takes a subclass of Universe (a new class). Each subclass will accept constructor parameters that define the universe. Since the universe is a set of criteria, we dynamically update the membership as the simulation runs through time - the current plan is to update the universe every quarter during the simulation. In code, choosing a universe looks like this:

def initialize(context):  
    # choose the top two percentile of DollarValue rankings  
    set_universe(DollarValueUniverse(98.0, 99.0))  

I'd love your feedback on both the DollarValue universe, the api, and the general direction!

Batch Transforms

The universe will give you the ability to define a large number of securities based on criteria, without requiring you to explicitly reference each sid. But, it is extremely unlikely that you will want to make investments in every security in your universe. You will want to explore the entire universe, and find pairs, related groups, or special individual stocks to bet on. To that end, we are adding a new facility that allows you to work with a cross-sectional history. So, instead of getting one day's prices or one minute's prices, you can get the last N days of prices/volumes/etc. All the data in the trailing window for all the securities in your universe will be packaged into a pandas data panel. You can then work with prices or volumes in separate dataframes. The code will look like this:

def initialize(context):  
    # construct a new batch transform, tuck it into the context  
    # call the function my_transform every two days, keeping 7 days of history in the data panel  
    context.a_batch_transform = BatchTransform(func=my_transform, refresh_period=2, delta=timedelta(days=7))

def handle_data(data, context):  
    # update the transform with the current data events  
    transform_results = context.a_batch_transform.handle_data(data)  
    # place orders, rebalance portfolio, etc.  
    order(transform_results.buys[0], 10)

def my_transform(data_panel):  
    prices_dataframe = data_panel['price']  
    # do work on prices...  
    # return the data your algorithm will need to make buy/sell choices  
    return transform_results