Back to Community

Announcement: Research API Additions

edited

Research API Improvements

Today we're announcing two quality-of-life improvements to the Quantopian Research API: simpler APIs for loading price/volume data, and new options for tuning memory usage of complex pipelines. We're also announcing a change to quantopian.research.experimental.history to bring it closer in line with the data.history method in the backtester.

Simpler APIs for Prices, Volumes, and Returns

We've added five new functions to the quantopian.research module that make it easier to perform common tasks involving price and volume data.

The new functions have the following names and signatures:

prices(assets, start, end, frequency='daily', price_field='price', start_offset=0)
log_prices(assets, start, end, frequency='daily', 'price_field='price', start_offset=0)
returns(assets, start, end, periods=1, frequency='daily', price_field='price')
log_returns(assets, start, end, periods=1, frequency='daily', price_field='price')
volumes(assets, start, end, frequency='daily', start_offset=0)

In general, each of these functions fetches data for one or more assets over a specified period, returning a Series if one asset was provided, and returning a DataFrame if multiple asset were provided. You can find complete documentation for all of these functions (along with the rest of the Research API) in the updated Research API Documentation.

The new API functions offer two benefits over the existing get_pricing function: better ergonomics, and start offsetting.

Better Ergonomics

One of the goals for the Research API is that it should be fast and simple to fetch whatever data you want to examine. The more time and energy you spend figuring out how to load the data you need, the less you have for doing your actual analysis.

Historically, the get_pricing function has been the primary interface for ad-hoc price and volume queries. Having a single-function interface to this data makes the API easier to remember (it's just the one function!), but it means that get_pricing needs a large number of parameters to support all the different ways that users need to fetch data.

We can't expect users to remember the name and order of all 8 arguments to get_pricing (I can't remember them, and I designed the function!), so get_pricing provides default values for almost all of its arguments. Having defaults for everything means that users only have to pass the parameters that they actually care about.

The problem with providing defaults for everything, however, is that it actually makes it harder to perform common tasks when doing so requires passing non-default values. I wrote the following description in the git commit message for this change in Quantopian's internal repository:

9 times out of 10 when I'm fetching pricing data I want close prices for one or more assets over some time period. The current API of get_pricing forces me to pass a bunch of additional parameters, and it has so many parameters that I can never remember the order, so I have to pass them all by keyword.

For example, to fetch recent daily close prices for AAPL with get_pricing, I would write something like:

data = get_pricing('AAPL', start_date='2016-01-02', end_date='2017-01-02', fields='price')

The above code is functional and clear, but it's a mouthful to type, and I had to pull up get_pricing's docstring to remember the names of the parameters and the correct string for fields.

With the new API, I can get the same output with the following:

data = prices('AAPL', '2016-01-02', '2017-01-02')

Much nicer!

Start Offsetting

One challenge that arises fairly regularly on Quantopian is what I call the "leading offset problem". This problem is exemplified as follows: I want to calculate rolling 5-day returns for AAPL for every trading day in January 2017. In order to compute what I want, I need to load daily close prices for the month of January and for the last four trading days of December 2016.

In general, whenever we want to compute a rolling N-period reduction from t[A] to t[B], we need to fetch data from t_[A - N] to t[B].

Correctly calculating the start date for a get_pricing call when we need a leading buffer of data is tricky even in simple cases (the right way to do this is usually to use the trading calendar arithmetic functions exported by the zipline.utils.calendars module), but it gets downright hairy when you want to support different asset classes and different data frequencies.

As of this update, all the Research API functions that fetch raw price/volume data now take an start_offset optional parameter that can be used to add a leading buffer of N periods to a query. For example, to load prices for January 2017 with an extra 5 days at the start, we can write:

prices('AAPL', '2017', '2017-02', start_offset=5)

start_offset is used internally by the new returns and log_returns functions. We could directly fetch our 5-day rolling returns with a call like:

returns('AAPL', '2017', '2017-02', periods=5)

New `chunksize` Parameter for `run_pipeline`

Users running complicated Pipelines sometimes find that they run into memory limits when running their pipelines over very long periods of time. In another post I've written at length about where and why Pipeline uses memory. At the end of that post, I gave the following advice:

The best way for you to reduce your high-water memory usage is to chunk up your run_pipeline calls into smaller increments and then concatenate (e.g. with pandas.concat) them together. For example, if you want to run a Pipeline with lots of graph terms over a 5 year period, you might break that up into five 1-year run_pipeline calls, or even ten 6-month calls. This reduces memory usage in two important ways:

Running over a shorter window straightforwardly translates to allocating fewer rows in the input/output buffers of each Pipeline term.

More subtly, running over a shorter window reduces the number of assets that are active during that window, which reduces the number of columns in the input/output buffers allocated to your Pipeline terms.

The Pipeline API in research can now transparently perform this chunking for you. If chunksize parameter is passed to run_pipeline, the range from start_date to end_date will automatically be broken up into blocks of size chunksize days and run separately. The default behavior is still to run a single chunk.

In general, pipelines run with large chunksizes will consume more memory, but will run faster, and pipeline run with small chunksizes will consume less memory, but will run more slowly. A good rule of thumb is that you probably want a chunksize of at least 126 (half a trading year equities), but smaller chunksizes might be desirable for extremely complex pipelines.

Updates to `quantopian.research.experimental.history`

The split between the Algorithm API and the Research API has been a historical source of confusion for many Quantopian users. Users often need to solve similar problems in research and in algorithms, but the two environments are different enough that a solution to a problem in an algorithm can end up looking very different to a solution to a similar problem in research. We've gotten better over time at building APIs that transfer cleanly between algorithms and research (the Pipeline and Optimize APIs, for example, were explicitly designed with this goal in mind), and we've also started revising older APIs to bring the two environments into closer alignment where possible.

When we released support for futures in the Research API, we added a new experimental history() function. history() is essentially a wrapper around the older get_pricing function, but renamed to reflect the fact that it serves the same essential purpose as the data.history() method in the Algorithm API. One important difference between the Research API history and Algorithm API history, however, was that the research version provided defaults for many parameters that the algorithm version did not. This made it easier to call the research history interactively, but made the correspondence between the functions less obvious and made it harder to port code using history between the environments. With the addition of the new convenience methods for fetching price/volume/returns, we feel that making the two history functions as consistent as possible is more important than making the research version easier to type, so we've updated the research version to align more closely with the algorithm version.

The signatures of the two history functions are now:

  # Algorithm API version  
def history(assets, fields, bar_count, frequency):

# Research API version  
def history(assets, fields, start, end, frequency, start_offset=0):

With this change, it's clearer that the only difference between the two historys is in how they talk about date ranges: Algorithm API history fetches data for a trailing window that ends at the current algorithm time; Research API history fetches data for a period between a fixed start and end, optionally with an additional offset from the start.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

6 responses

Costantino

Good News! I wonder, that I'm the first to comment this post!

Are Fundamentals.ev_to_ebitda1_year_growth and winsorize also new features?

By the way, there is any progress about the features requests that I discussed with Jamie McCorriston on Januar:
1) Quarterly rebalance: is it possible to add also a date_rules.quarter_start() rule or a filter in month_start to schedule a function only in some specific months (for example my algo rebalances quarterly on January, April, July and October)
2) Percentage rank: actually zipline.pipeline.factors.Rank assigns only the integer position number. Is it possible to add a percentage rank, like that pandas.DataFrame.rank (pct : boolean, default False, Computes percentage rank of data )

Joakim Arvidsson (Cream Mongoose)

Hi @Scott,

I'm curious if there's any plan to make these functions available in the IDE as well? Thanks.

prices(assets, start, end, frequency='daily', price_field='price', start_offset=0)

log_prices(assets, start, end, frequency='daily', 'price_field='price', start_offset=0)

returns(assets, start, end, periods=1, frequency='daily', price_field='price')

log_returns(assets, start, end, periods=1, frequency='daily', price_field='price')

volumes(assets, start, end, frequency='daily', start_offset=0)

Jamie McCorriston

Hi Joakim,

We don't have plans to add those functions to the IDE in the near term. However, if you're looking for daily values, there are built-in Pipeline factors that can get you what you need. I've attached a notebook that has an example for each of the 5 functions you listed above (the Pipeline can be run in a notebook or in the IDE).

I hope this helps.

Disclaimer

Joakim Arvidsson (Cream Mongoose)

Ok, thanks Jamie!

Scott Sanderson

Hi Joakim,

It probably doesn't make sense to port these helper functions verbatim from research to the backtester, because the research versions use start_date and end_date to specify dates, whereas all the price-retrieval functions in the backtester take a bar_count parameter to specify a number of observations from the current simulation time. The main reason for this difference is that we don't want it to be possible to fetch prices from the future in backtesting.

Here's what I'd probably use if I wanted to port these utilities to the backtester:

NOTE: I sketched these out relatively quickly this morning. I tested a few of them manually in the backtester, but I haven't verified them extensively. Buyer beware and all that... :)

from __future__ import print_function  
import numpy as np

def prices(data,  
           assets,  
           bar_count,  
           frequency='1d',  # replace with '1m' for minutely data.  
           price_field='price'):  
    return data.history(assets, price_field, bar_count, frequency)


def volumes(data, assets, bar_count, frequency='1d'):  
    return data.history(assets, 'volume', bar_count, frequency)


def log_prices(data,  
               assets,  
               bar_count,  
               frequency='1d',  
               price_field='price'):  
    df = prices(data, assets, bar_count, frequency=frequency, price_field=price_field)  
    return np.log(df)


def returns(data,  
            assets,  
            bar_count,  
            periods,  
            frequency='1d',  
            price_field='price'):

    df = prices(data,  
                assets,  
                bar_count + periods,  
                frequency=frequency,  
                price_field=price_field)

    return df.pct_change(periods).iloc[periods:]


def log_returns(data,  
                assets,  
                bar_count,  
                periods,  
                frequency='1d',  
                price_field='price'):

    df = log_prices(data,  
                    assets,  
                    bar_count + periods,  
                    frequency=frequency,  
                    price_field=price_field)

    return df.diff(periods).iloc[periods:]

Disclaimer

Joakim Arvidsson (Cream Mongoose)

Hi @Scott,

Awesome! Thanks so much for taking the time putting this together. Would have taken me ages! I understand the disclaimer and the reason for why they don't port directly to the back-tester.

Thanks again!

You've successfully submitted a support ticket.

Our support team will be in touch soon.

Research API Improvements

Simpler APIs for Prices, Volumes, and Returns

Better Ergonomics

Start Offsetting

New chunksize Parameter for run_pipeline

Updates to quantopian.research.experimental.history

New `chunksize` Parameter for `run_pipeline`

Updates to `quantopian.research.experimental.history`