In this notebook, we take a look at the performance of various functions in the notebook from this community post. The goal of this notebook is to help you understand which operations are fast and which ones are slow so you can optimize the way you interact with the notebook.
from quantopian.pipeline import Pipeline
from quantopian.pipeline.data.alpha_vertex import precog_top_100
from quantopian.pipeline.data import EquityPricing, factset
from quantopian.pipeline.factors import Returns, SimpleBeta, SimpleMovingAverage
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.research import run_pipeline
import time
import pandas as pd
import numpy as np
START = pd.Timestamp("2010-01-05")
END = pd.Timestamp("2017-01-01")
The speed at which premium data is loaded can vary widely. There are a couple of factors that affect load times.
universe = QTradableStocksUS()
pipe = Pipeline(columns={'alpha': precog_top_100.predicted_five_day_log_return.latest}, screen=universe)
starttime = time.time()
results_precog = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with precog data took %.2f seconds." % (time.time() - starttime)
If we run the exact same computation, the load time improves significantly. Restarting the notebook negates this effect.
starttime = time.time()
results_precog = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with precog data took %.2f seconds." % (time.time() - starttime)
Core datasets have a different backend implementation that supports much faster load times. The Fundamentals dataset is orders of magnitudes larger than the precog_top_100 dataset
, but it loads much more quickly. We are looking to add new datasets from FactSet in a similar way to how we added the core datasets.
pipe = Pipeline(columns={'alpha': factset.Fundamentals.mkt_val.latest}, screen=universe)
starttime = time.time()
results_mcap = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with fundamental data took %.2f seconds." % (time.time() - starttime)
pipe = Pipeline(columns={'alpha': EquityPricing.close.latest}, screen=universe)
starttime = time.time()
results_price = run_pipeline(pipe, start_date=START, end_date=END).dropna()
print "Running a pipeline with pricing data took %.2f seconds." % (time.time() - starttime)
Getting factor loadings is usually quick.
from quantopian.research.experimental import get_factor_returns, get_factor_loadings
assets = results_precog.index.levels[1]
starttime = time.time()
# Load risk factor loadings and returns
factor_loadings = get_factor_loadings(assets, START, END + pd.Timedelta(days=30))
factor_returns = get_factor_returns(START, END + pd.Timedelta(days=30))
print time.time() - starttime
Getting pricing data from get_pricing
is usually quick.
starttime = time.time()
pricing = get_pricing(assets, START, END + pd.Timedelta(days=30), fields="close_price")
print "Getting pricing data for alphalens took %.2f seconds." % (time.time() - starttime)
import alphalens as al
It seems the get_clean_factor_and_forward_returns
function in alphalens.utils
is the culprit in this notebook. It takes about 3 minutes to run. Further in the notebook, this gets called 5 times to generate a single plot.
starttime = time.time()
factor_data_total = al.utils.get_clean_factor_and_forward_returns(
results_precog['alpha'],
pricing,
periods=range(1, 15))
print "get_clean_factor_and_forward_returns took %.2f seconds." % (time.time() - starttime)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import empyrical as ep
import alphalens as al
import pyfolio as pf
from quantopian.research.experimental import get_factor_returns, get_factor_loadings
def compute_specific_returns(total_returns, factor_returns, factor_loadings):
factor_returns.index = factor_returns.index.set_names(['dt'])
factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
common_returns = factor_loadings.mul(factor_returns).sum(axis='columns').unstack()
specific_returns = total_returns - common_returns
return specific_returns
def factor_portfolio_returns(factor, pricing, equal_weight=True, delay=0):
if equal_weight:
factor = np.sign(factor)
bins = (-1, 0, 1)
quantiles = None
zero_aware = False
else:
bins = None
quantiles = 5
zero_aware = True
pos = factor.unstack().fillna(0)
pos = (pos / pos.abs().sum()).reindex(pricing.index).ffill().shift(delay)
# Fully invested, shorts show up as cash
pos['cash'] = pos[pos < 0].sum(axis='columns')
factor_and_returns = al.utils.get_clean_factor_and_forward_returns(
pos.stack().loc[lambda x: x != 0],
pricing, periods=(1,), quantiles=quantiles, bins=bins,
zero_aware=zero_aware)
return al.performance.factor_returns(factor_and_returns)['1D'], pos
def plot_ic_over_time(factor_data, label='', ax=None):
mic = al.performance.mean_information_coefficient(factor_data)
mic.index = mic.index.map(lambda x: int(x[:-1]))
ax = mic.plot(label=label, ax=ax)
ax.set(xlabel='Days', ylabel='Mean IC')
ax.legend()
ax.axhline(0, ls='--', color='k')
def plot_cum_returns_delay(factor, pricing, delay=range(5), ax=None):
if ax is None:
fig, ax = plt.subplots()
for d in delay:
portfolio_returns, _ = factor_portfolio_returns(factor, pricing, delay=d)
ep.cum_returns(portfolio_returns).plot(ax=ax, label=d)
ax.legend()
ax.set(ylabel='Cumulative returns', title='Cumulative returns if factor is delayed')
def plot_exposures(risk_exposures, ax=None):
rep = risk_exposures.stack().reset_index()
rep.columns = ['dt', 'factor', 'exposure']
sns.boxplot(x='exposure', y='factor', data=rep, orient='h', ax=ax, order=risk_exposures.columns[::-1])
def plot_overview_tear_sheet(factor, prices, factor_returns, factor_loadings, periods=range(1, 15)):
stock_rets = pricing.pct_change()
stock_rets_specific = compute_specific_returns(stock_rets, factor_returns, factor_loadings)
cr_specific = ep.cum_returns(stock_rets_specific, starting_value=1)
factor_data_total = al.utils.get_clean_factor_and_forward_returns(
factor,
pricing,
periods=periods)
factor_data_specific = al.utils.get_clean_factor_and_forward_returns(
factor,
cr_specific,
periods=periods)
portfolio_returns, portfolio_pos = factor_portfolio_returns(factor, pricing)
factor_loadings.index = factor_loadings.index.set_names(['dt', 'ticker'])
portfolio_pos.index = portfolio_pos.index.set_names(['dt'])
risk_exposures_portfolio, perf_attribution = pf.perf_attrib.perf_attrib(
portfolio_returns,
portfolio_pos,
factor_returns,
factor_loadings,
pos_in_dollars=False)
fig = plt.figure(figsize=(16, 16))
gs = plt.GridSpec(4, 4)
ax1 = plt.subplot(gs[0:2, 0:2])
plot_ic_over_time(factor_data_total, label='Total returns', ax=ax1)
plot_ic_over_time(factor_data_specific, label='Specific returns', ax=ax1)
ax2 = plt.subplot(gs[0:2, 2:4])
plot_cum_returns_delay(factor, pricing, ax=ax2)
ax3 = plt.subplot(gs[2:4, 0:2])
plot_exposures(risk_exposures_portfolio.reindex(columns=perf_attribution.columns),
ax=ax3)
ax4 = plt.subplot(gs[2:4, 2])
ep.cum_returns_final(perf_attribution).plot.barh(ax=ax4)
ax4.set(xlabel='Cumulative returns')
ax5 = plt.subplot(gs[2:4, 3], sharey=ax4)
perf_attribution.apply(ep.annual_volatility).plot.barh(ax=ax5, color='r')
ax5.set(xlabel='Ann. volatility')
gs.tight_layout(fig)
The second plot calls get_clean_factor_and_forward_returns
5 times. All in all, the function seems to get called 8 times (based on the number of warning messages). This is the primary reason why generating these plots takes so long.
starttime = time.time()
plot_overview_tear_sheet(results_precog['alpha'], pricing, factor_returns, factor_loadings)
print "get_clean_factor_and_forward_returns took %.2f seconds." % (time.time() - starttime)
There are two operations that can take significant time:
alphalens.utils.get_clean_factor_and_forward_returns
is quite slow. We will have to visit this function and see if we can speed it up. Unfortunately, I'm not aware of a workaround right now.