Notebook

In this notebook, we take a look at the performance of various functions in the notebook from this community post. The goal of this notebook is to help you understand which operations are fast and which ones are slow so you can optimize the way you interact with the notebook.

In [1]:
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.alpha_vertex import precog_top_100
from quantopian.pipeline.data import EquityPricing, factset
from quantopian.pipeline.factors import Returns, SimpleBeta, SimpleMovingAverage
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.research import run_pipeline

import time
import pandas as pd
import numpy as np

START = pd.Timestamp("2010-01-05")
END = pd.Timestamp("2017-01-01")

"Premium" Dataset

The speed at which premium data is loaded can vary widely. There are a couple of factors that affect load times.

  • The speed at which premium data is loaded depends on the number of notebooks/algorithms accessing any premium dataset at one time. Generally speaking, contest algorithms are run between 3:30AM ET - 8:30AM ET. Many contest algorithms use premium datasets, so loading premium data during this time is slower than normal.
  • Loading data for the first time is slow. However, if you run the same computation (a.k.a. load data in the same way) that you ran recently, the load time will be much faster.
In [2]:
universe = QTradableStocksUS()
In [3]:
pipe = Pipeline(columns={'alpha': precog_top_100.predicted_five_day_log_return.latest}, screen=universe)

starttime = time.time()
results_precog = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with precog data took %.2f seconds." % (time.time() - starttime)
Running a pipeline with precog data took 84.69 seconds.

If we run the exact same computation, the load time improves significantly. Restarting the notebook negates this effect.

In [4]:
starttime = time.time()
results_precog = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with precog data took %.2f seconds." % (time.time() - starttime)
Running a pipeline with precog data took 23.95 seconds.

Core Dataset

Core datasets have a different backend implementation that supports much faster load times. The Fundamentals dataset is orders of magnitudes larger than the precog_top_100 dataset, but it loads much more quickly. We are looking to add new datasets from FactSet in a similar way to how we added the core datasets.

In [5]:
pipe = Pipeline(columns={'alpha': factset.Fundamentals.mkt_val.latest}, screen=universe)

starttime = time.time()
results_mcap = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with fundamental data took %.2f seconds." % (time.time() - starttime)
Running a pipeline with fundamental data took 14.62 seconds.
In [6]:
pipe = Pipeline(columns={'alpha': EquityPricing.close.latest}, screen=universe)

starttime = time.time()
results_price = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with pricing data took %.2f seconds." % (time.time() - starttime)
Running a pipeline with pricing data took 17.11 seconds.

Factor Loadings

Getting factor loadings is usually quick.

In [7]:
from quantopian.research.experimental import get_factor_returns, get_factor_loadings
In [8]:
assets = results_precog.index.levels[1]
In [9]:
starttime = time.time()
# Load risk factor loadings and returns
factor_loadings = get_factor_loadings(assets, START, END + pd.Timedelta(days=30))
factor_returns = get_factor_returns(START, END + pd.Timedelta(days=30))
print time.time() - starttime
11.2394940853

Preparing Data for Alphalens

Getting pricing data from get_pricing is usually quick.

In [10]:
starttime = time.time()
pricing = get_pricing(assets, START, END + pd.Timedelta(days=30), fields="close_price")
print "Getting pricing data for alphalens took %.2f seconds." % (time.time() - starttime)
Getting pricing data for alphalens took 1.95 seconds.
In [11]:
import alphalens as al

It seems the get_clean_factor_and_forward_returns function in alphalens.utils is the culprit in this notebook. It takes about 3 minutes to run. Further in the notebook, this gets called 5 times to generate a single plot.

In [12]:
starttime = time.time()
factor_data_total = al.utils.get_clean_factor_and_forward_returns(
    results_precog['alpha'], 
    pricing,
    periods=range(1, 15))
print "get_clean_factor_and_forward_returns took %.2f seconds." % (time.time() - starttime)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
get_clean_factor_and_forward_returns took 177.56 seconds.
In [13]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import empyrical as ep
import alphalens as al
import pyfolio as pf

from quantopian.research.experimental import get_factor_returns, get_factor_loadings

def compute_specific_returns(total_returns, factor_returns, factor_loadings):
    
    factor_returns.index = factor_returns.index.set_names(['dt'])
    factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
    common_returns = factor_loadings.mul(factor_returns).sum(axis='columns').unstack()
    specific_returns = total_returns - common_returns
    return specific_returns

def factor_portfolio_returns(factor, pricing, equal_weight=True, delay=0):
    if equal_weight:
        factor = np.sign(factor)
        bins = (-1, 0, 1)
        quantiles = None
        zero_aware = False
    else:
        bins = None
        quantiles = 5
        zero_aware = True
        
    pos = factor.unstack().fillna(0)
    pos = (pos / pos.abs().sum()).reindex(pricing.index).ffill().shift(delay)
    # Fully invested, shorts show up as cash
    pos['cash'] = pos[pos < 0].sum(axis='columns')
    
    factor_and_returns = al.utils.get_clean_factor_and_forward_returns(
        pos.stack().loc[lambda x: x != 0], 
        pricing, periods=(1,), quantiles=quantiles, bins=bins, 
        zero_aware=zero_aware)
    
    return al.performance.factor_returns(factor_and_returns)['1D'], pos

def plot_ic_over_time(factor_data, label='', ax=None):
    mic = al.performance.mean_information_coefficient(factor_data)
    mic.index = mic.index.map(lambda x: int(x[:-1])) 
    ax = mic.plot(label=label, ax=ax)
    ax.set(xlabel='Days', ylabel='Mean IC')
    ax.legend()
    ax.axhline(0, ls='--', color='k')
    
def plot_cum_returns_delay(factor, pricing, delay=range(5), ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    for d in delay:
        portfolio_returns, _ = factor_portfolio_returns(factor, pricing, delay=d)
        ep.cum_returns(portfolio_returns).plot(ax=ax, label=d)
    ax.legend()
    ax.set(ylabel='Cumulative returns', title='Cumulative returns if factor is delayed')
    
def plot_exposures(risk_exposures, ax=None):
    rep = risk_exposures.stack().reset_index()
    rep.columns = ['dt', 'factor', 'exposure']
    sns.boxplot(x='exposure', y='factor', data=rep, orient='h', ax=ax, order=risk_exposures.columns[::-1])
    
def plot_overview_tear_sheet(factor, prices, factor_returns, factor_loadings, periods=range(1, 15)):
    stock_rets = pricing.pct_change()
    stock_rets_specific = compute_specific_returns(stock_rets, factor_returns, factor_loadings)
    cr_specific = ep.cum_returns(stock_rets_specific, starting_value=1)
    
    factor_data_total = al.utils.get_clean_factor_and_forward_returns(
        factor, 
        pricing,
        periods=periods)
    
    factor_data_specific = al.utils.get_clean_factor_and_forward_returns(
        factor, 
        cr_specific,
        periods=periods)
    
    portfolio_returns, portfolio_pos = factor_portfolio_returns(factor, pricing)

    factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
    portfolio_pos.index = portfolio_pos.index.set_names(['dt'])
    risk_exposures_portfolio, perf_attribution = pf.perf_attrib.perf_attrib(
        portfolio_returns, 
        portfolio_pos, 
        factor_returns, 
        factor_loadings, 
        pos_in_dollars=False)

    fig = plt.figure(figsize=(16, 16))
    gs = plt.GridSpec(4, 4)
    ax1 = plt.subplot(gs[0:2, 0:2])

    plot_ic_over_time(factor_data_total, label='Total returns', ax=ax1)
    plot_ic_over_time(factor_data_specific, label='Specific returns', ax=ax1)

    ax2 = plt.subplot(gs[0:2, 2:4])
    plot_cum_returns_delay(factor, pricing, ax=ax2)

    ax3 = plt.subplot(gs[2:4, 0:2])
    plot_exposures(risk_exposures_portfolio.reindex(columns=perf_attribution.columns), 
                   ax=ax3)

    ax4 = plt.subplot(gs[2:4, 2])
    ep.cum_returns_final(perf_attribution).plot.barh(ax=ax4)
    ax4.set(xlabel='Cumulative returns')

    ax5 = plt.subplot(gs[2:4, 3], sharey=ax4)
    perf_attribution.apply(ep.annual_volatility).plot.barh(ax=ax5, color='r')
    ax5.set(xlabel='Ann. volatility')

    gs.tight_layout(fig)

The second plot calls get_clean_factor_and_forward_returns 5 times. All in all, the function seems to get called 8 times (based on the number of warning messages). This is the primary reason why generating these plots takes so long.

In [14]:
starttime = time.time()
plot_overview_tear_sheet(results_precog['alpha'], pricing, factor_returns, factor_loadings)
print "get_clean_factor_and_forward_returns took %.2f seconds." % (time.time() - starttime)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 1.0% entries from factor data: 1.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 1.0% entries from factor data: 1.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 1.0% entries from factor data: 1.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 1.0% entries from factor data: 1.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 1.0% entries from factor data: 1.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
Dropped 1.0% entries from factor data: 1.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
get_clean_factor_and_forward_returns took 621.70 seconds.

Conclusion

There are two operations that can take significant time:

  1. Loading a computation that depends on a "premium" dataset for the first time. Running the same computation in the same kernel (without restarting the notebook) will be much faster. This load time can also vary based on how many people are loading it at once. The busiest time is 3:30AM - 8:30AM ET, when contest backtests are run on a daily basis.
  2. alphalens.utils.get_clean_factor_and_forward_returns is quite slow. We will have to visit this function and see if we can speed it up. Unfortunately, I'm not aware of a workaround right now.