import numpy as np
import matplotlib.pyplot as plt
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.factors import SimpleMovingAverage,AnnualizedVolatility,Returns,DailyReturns,CustomFactor
from statsmodels import regression
import statsmodels.api as sm
import pandas as pd
class Var(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252
def compute(self, today, asset_ids, out, prices):
sums = 0;sum = 0
for i in range(1,252):
temp = (prices[-i]-prices[-i-1])/252
sum = sum+temp
for i in range(1,252):
temp = ((prices[-i]-prices[-i-1])-sum)**2
sums = sums + temp
sums = sums/252
out[:] = sums
class MeanRet(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252
def compute(self, today, asset_ids, out, prices):
sum = 0
for i in range(1,252):
temp = (prices[-i]-prices[-i-1])
sum = sum+temp
sum = sum/252
out[:] = sum
class CumReturn(CustomFactor):
# Default inputs.
inputs = [USEquityPricing.close]
window_length = 252
def compute(self, today, asset_ids, out,prices):
out[:] = (prices[-1]-prices[0])/prices[0]
class RadjRet(CustomFactor):
inputs = [USEquityPricing.close]
window_length = 252
def compute(self, today, asset_ids, out, prices):
radj = 0
sum=0;sums=0
for i in range(1,252):
temp = (prices[-i]-prices[-i-1])/252
sum = sum+temp
for i in range(1,252):
temp = (((prices[-i]-prices[-i-1])-sum)**2)/252
sums = sums + temp
radj = sum/sums**0.5
out[:] = radj
def make_pipeline():
#Set Universe
univ = QTradableStocksUS()
#Factors
rdel = DailyReturns(mask = univ)
rcum = CumReturn(mask = univ)
mean = MeanRet(mask = univ)
var = Var(mask = univ)
radj_ret = RadjRet(mask = univ)
close1 = USEquityPricing.close.latest
#Filters
fil1 = rcum.percentile_between(0.0,10.0)
fil2 = rcum.percentile_between(90.0,100.0)
fil3 = radj_ret>rcum
fil = (fil1 & fil3) | (fil2 & fil3)
return Pipeline(columns =
{'Daily Return':rdel,
'Mean Return':mean,
'Cumulative Ret':rcum,
'Variance': var,
'Risk Adj Ret':radj_ret}
,screen = fil)
results = run_pipeline(make_pipeline(),start_date = '2017-01-03',end_date = '2017-01-05')
results.info()
The level_0 index are dates. The level_1 index are the security objects. The columns are whatever factors were defined in the pipeline definition.
In an algo (ie the IDE environment), a pipeline returns only a single indexed dataframe. The index is the securities. There is no date index. During an algo simulation, pipeline only returns data for the current simulation day. Therefore, the date isn't included in the index and is implied to be the simulation date.
In a notebook, a pipeline returns data for multiple dates. Each date is the dataframe which would be returned for that date in an algo. One can think of a notebook pipeline dataframe as multiple algo dataframes 'stacked' up. One for each date.
First, generally do not reset the index. Leave it 'as is'.
If one wants to look at a specific date use the pandas xs
method. This will return a single index slice of the dataframe just for one date. This is convenient because the reult will be identical to the dataframe returned by a pipeline in the IDE. One can manipulate the dataframe the same as one would do in their algo.
results.xs('2017-01-03').head()
If one wants to analyze, or look at, a single security over many dates use the same xs
method but specify the security and level=1
(remember the level_1 index contains the securities). Again, the reult will be a single indexed dataframe.
results.xs(symbols('ARNC'), level=1).head()
It's sometimes convenient to name the indexes. Not all methods accept named indexes but it can make things more readable. One can use the set_names
method. Remember to use this method on the index and not the dataframe itself (ie results.index.set_names
and not results.set_names
).
results.index.set_names(['date', 'security'], inplace=True)
results.head()
For more complex filtering of dates and securities I like using the query
method. One can use the index names in the query to make selection easier. Here we want to get data for two securities ('ARNC', 'AKRX') for a single date ('2017-01-03'). The loc
method can also be used but I find the query
method much more powerful.
selected_securities = symbols(['ARNC', 'AKRX'])
results.query("date == '2017-01-03' and security in @selected_securities").head()
Now, instead of looking at one specific date, or a specific security, maybe you want to look at aggregates of dates or securities. The pandas groupby
method works well for this. For example, to get the mean variance for each security, first group by level_1, then take the mean. Since we are fetching a single value for each security, the result will be a pandas series indexed by security.
results.groupby(level=1).Variance.mean().head()
If we don't specify a specific column to perform a calculation (eg 'Variance') the calculation will be applied to all columns and the result will be a dataframe. One can also apply different calculations to each column in a single step by using the agg
method.
results.groupby(level=1).mean().head()
Many of the pandas methods expect value to be presented as rows and columns. For example, to plot 'Daily Return' useing the plot
method, pandas expects dates to be the index and the securities to be columns. The unstack
method will turn one of the multi-indexes into columns. Think of it as taking a tall 'stack' of securities and shortening it (and making the columns wider). Let's look at the 'Daily Return' for our two stocks 'ARNC' and 'AKRX'. Since we are looking at just a single value, the result will initially be a multi-index series.
results.query("security in @selected_securities")['Daily Return']
If we want a single index series with dates as the rows and securities at the columns, then use unstack
.
results.query("security in @selected_securities")['Daily Return'].unstack()
This is the format many pandas methods expect the data. plot
is a good example. Let's plot the 'Daily Return' values for these two securities.
results.query("security in @selected_securities")['Daily Return'].unstack().plot()
Many directions to go from here but this is the basics. But, one last note. Generally, don't sort dataframes and especially not the indexes unless there is a good reason. If one wants to find the largest or smallest values, a great couple of methods are nlargest
and nsmallest
. No need to sort and select. It's all done in a single step. Let's get the 5 securities with the largest and smallest 'Daily Return'.
results.nlargest(5, 'Daily Return')
results.nsmallest(5, 'Daily Return')