Notebook
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.pipeline.filters import QTradableStocksUS
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.factors import SimpleMovingAverage,AnnualizedVolatility,Returns,DailyReturns,CustomFactor
from statsmodels import regression
import statsmodels.api as sm
import pandas as pd
In [2]:
class Var(CustomFactor):
    
    inputs = [USEquityPricing.close]
    window_length = 252
    
    def compute(self, today, asset_ids, out, prices):
        sums = 0;sum = 0
        for i in range(1,252):
            temp = (prices[-i]-prices[-i-1])/252
            sum = sum+temp
        for i in range(1,252):
            temp = ((prices[-i]-prices[-i-1])-sum)**2
            sums = sums + temp
        sums = sums/252
        out[:] = sums
In [3]:
class MeanRet(CustomFactor):
    
    inputs = [USEquityPricing.close]
    window_length = 252
    
    def compute(self, today, asset_ids, out, prices):
        sum = 0
        for i in range(1,252):
            temp = (prices[-i]-prices[-i-1])
            sum = sum+temp
        sum = sum/252
        out[:] = sum
In [4]:
class CumReturn(CustomFactor):
    # Default inputs.
    inputs = [USEquityPricing.close]
    window_length = 252
    def compute(self, today, asset_ids, out,prices):
        out[:] = (prices[-1]-prices[0])/prices[0] 
In [5]:
class RadjRet(CustomFactor):
    
    inputs = [USEquityPricing.close]
    window_length = 252
    
    def compute(self, today, asset_ids, out, prices):
        radj = 0
        sum=0;sums=0
        for i in range(1,252):
            temp = (prices[-i]-prices[-i-1])/252
            sum = sum+temp
        for i in range(1,252):
            temp = (((prices[-i]-prices[-i-1])-sum)**2)/252
            sums = sums + temp    
        radj = sum/sums**0.5    
        out[:] = radj
In [6]:
def make_pipeline():
    
     
     #Set Universe
    univ = QTradableStocksUS() 
    
    #Factors
    rdel = DailyReturns(mask = univ)
    rcum = CumReturn(mask = univ)
    mean = MeanRet(mask = univ)
    var = Var(mask = univ)
    radj_ret = RadjRet(mask = univ)
    close1 = USEquityPricing.close.latest
    
    #Filters
    fil1 = rcum.percentile_between(0.0,10.0)
    fil2 = rcum.percentile_between(90.0,100.0)
    fil3 = radj_ret>rcum
    fil = (fil1 & fil3) | (fil2 & fil3)
    return Pipeline(columns = 
                   {'Daily Return':rdel,
                    'Mean Return':mean,
                    'Cumulative Ret':rcum,
                    'Variance': var,
                    'Risk Adj Ret':radj_ret}
                    ,screen = fil)
    
In [7]:
results = run_pipeline(make_pipeline(),start_date = '2017-01-03',end_date = '2017-01-05')

Pipeline Execution Time: 15.39 Seconds
In [8]:
results.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 600 entries, (2017-01-03 00:00:00+00:00, Equity(2 [ARNC])) to (2017-01-05 00:00:00+00:00, Equity(49501 [LIVN]))
Data columns (total 5 columns):
Cumulative Ret    600 non-null float64
Daily Return      600 non-null float64
Mean Return       600 non-null float64
Risk Adj Ret      600 non-null float64
Variance          600 non-null float64
dtypes: float64(5)
memory usage: 28.1+ KB

Notice that pipeline returns a multi-index dataframe.

The level_0 index are dates. The level_1 index are the security objects. The columns are whatever factors were defined in the pipeline definition.

In an algo (ie the IDE environment), a pipeline returns only a single indexed dataframe. The index is the securities. There is no date index. During an algo simulation, pipeline only returns data for the current simulation day. Therefore, the date isn't included in the index and is implied to be the simulation date.

In a notebook, a pipeline returns data for multiple dates. Each date is the dataframe which would be returned for that date in an algo. One can think of a notebook pipeline dataframe as multiple algo dataframes 'stacked' up. One for each date.

So, how to manipulate this multi-index dataframe?

First, generally do not reset the index. Leave it 'as is'.

If one wants to look at a specific date use the pandas xs method. This will return a single index slice of the dataframe just for one date. This is convenient because the reult will be identical to the dataframe returned by a pipeline in the IDE. One can manipulate the dataframe the same as one would do in their algo.

In [9]:
results.xs('2017-01-03').head()
Out[9]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
Equity(2 [ARNC]) -0.352692 -0.008552 -0.040108 -0.041176 0.948796
Equity(270 [AKRX]) -0.427034 0.032395 -0.064563 -0.061548 1.100402
Equity(301 [ALKS]) -0.256684 0.004698 -0.076190 -0.034239 4.951874
Equity(508 [AIRM]) -0.202354 0.003150 -0.032063 -0.038064 0.709571
Equity(1244 [CAMP]) -0.272590 -0.014956 -0.021548 -0.056893 0.143446

If one wants to analyze, or look at, a single security over many dates use the same xs method but specify the security and level=1 (remember the level_1 index contains the securities). Again, the reult will be a single indexed dataframe.

In [11]:
results.xs(symbols('ARNC'), level=1).head()
Out[11]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
2017-01-03 00:00:00+00:00 -0.352692 -0.008552 -0.040108 -0.041176 0.948796
2017-01-04 00:00:00+00:00 -0.298540 0.034501 -0.032410 -0.033352 0.944277
2017-01-05 00:00:00+00:00 -0.225028 0.026055 -0.022688 -0.023517 0.930719

It's sometimes convenient to name the indexes. Not all methods accept named indexes but it can make things more readable. One can use the set_names method. Remember to use this method on the index and not the dataframe itself (ie results.index.set_names and not results.set_names).

In [12]:
results.index.set_names(['date', 'security'], inplace=True)
results.head()
Out[12]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
date security
2017-01-03 00:00:00+00:00 Equity(2 [ARNC]) -0.352692 -0.008552 -0.040108 -0.041176 0.948796
Equity(270 [AKRX]) -0.427034 0.032395 -0.064563 -0.061548 1.100402
Equity(301 [ALKS]) -0.256684 0.004698 -0.076190 -0.034239 4.951874
Equity(508 [AIRM]) -0.202354 0.003150 -0.032063 -0.038064 0.709571
Equity(1244 [CAMP]) -0.272590 -0.014956 -0.021548 -0.056893 0.143446

For more complex filtering of dates and securities I like using the query method. One can use the index names in the query to make selection easier. Here we want to get data for two securities ('ARNC', 'AKRX') for a single date ('2017-01-03'). The loc method can also be used but I find the query method much more powerful.

In [27]:
selected_securities = symbols(['ARNC', 'AKRX'])
results.query("date == '2017-01-03' and security in @selected_securities").head()
Out[27]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
date security
2017-01-03 00:00:00+00:00 Equity(2 [ARNC]) -0.352692 -0.008552 -0.040108 -0.041176 0.948796
Equity(270 [AKRX]) -0.427034 0.032395 -0.064563 -0.061548 1.100402

Now, instead of looking at one specific date, or a specific security, maybe you want to look at aggregates of dates or securities. The pandas groupby method works well for this. For example, to get the mean variance for each security, first group by level_1, then take the mean. Since we are fetching a single value for each security, the result will be a pandas series indexed by security.

In [28]:
results.groupby(level=1).Variance.mean().head()
Out[28]:
security
Equity(2 [ARNC])       0.941264
Equity(270 [AKRX])     1.102792
Equity(301 [ALKS])     4.971454
Equity(508 [AIRM])     0.709571
Equity(1244 [CAMP])    0.144394
Name: Variance, dtype: float64

If we don't specify a specific column to perform a calculation (eg 'Variance') the calculation will be applied to all columns and the result will be a dataframe. One can also apply different calculations to each column in a single step by using the agg method.

In [17]:
results.groupby(level=1).mean().head()
Out[17]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
security
Equity(2 [ARNC]) -0.292086 0.017335 -0.031735 -0.032682 0.941264
Equity(270 [AKRX]) -0.414876 0.034845 -0.063135 -0.060124 1.102792
Equity(301 [ALKS]) -0.237845 0.026964 -0.070820 -0.031772 4.971454
Equity(508 [AIRM]) -0.202354 0.003150 -0.032063 -0.038064 0.709571
Equity(1244 [CAMP]) -0.259754 0.017284 -0.020648 -0.054369 0.144394

Many of the pandas methods expect value to be presented as rows and columns. For example, to plot 'Daily Return' useing the plot method, pandas expects dates to be the index and the securities to be columns. The unstack method will turn one of the multi-indexes into columns. Think of it as taking a tall 'stack' of securities and shortening it (and making the columns wider). Let's look at the 'Daily Return' for our two stocks 'ARNC' and 'AKRX'. Since we are looking at just a single value, the result will initially be a multi-index series.

In [32]:
results.query("security in @selected_securities")['Daily Return']
Out[32]:
date                       security          
2017-01-03 00:00:00+00:00  Equity(2 [ARNC])     -0.008552
                           Equity(270 [AKRX])    0.032395
2017-01-04 00:00:00+00:00  Equity(2 [ARNC])      0.034501
                           Equity(270 [AKRX])    0.011452
2017-01-05 00:00:00+00:00  Equity(2 [ARNC])      0.026055
                           Equity(270 [AKRX])    0.060688
Name: Daily Return, dtype: float64

If we want a single index series with dates as the rows and securities at the columns, then use unstack.

In [33]:
results.query("security in @selected_securities")['Daily Return'].unstack()
Out[33]:
security Equity(2 [ARNC]) Equity(270 [AKRX])
date
2017-01-03 00:00:00+00:00 -0.008552 0.032395
2017-01-04 00:00:00+00:00 0.034501 0.011452
2017-01-05 00:00:00+00:00 0.026055 0.060688

This is the format many pandas methods expect the data. plot is a good example. Let's plot the 'Daily Return' values for these two securities.

In [34]:
results.query("security in @selected_securities")['Daily Return'].unstack().plot()
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a4a712c90>

Many directions to go from here but this is the basics. But, one last note. Generally, don't sort dataframes and especially not the indexes unless there is a good reason. If one wants to find the largest or smallest values, a great couple of methods are nlargest and nsmallest. No need to sort and select. It's all done in a single step. Let's get the 5 securities with the largest and smallest 'Daily Return'.

In [35]:
results.nlargest(5, 'Daily Return')
Out[35]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
date security
2017-01-05 00:00:00+00:00 Equity(23269 [WW]) -0.360992 0.210479 -0.030040 -0.046936 0.409621
2017-01-04 00:00:00+00:00 Equity(40319 [PACB]) -0.649236 0.145861 -0.032024 -0.070660 0.205402
2017-01-05 00:00:00+00:00 Equity(49335 [GBT]) -0.333751 0.117338 -0.031706 -0.033462 0.897805
Equity(48823 [SEDG]) -0.503726 0.098859 -0.058202 -0.080816 0.518661
Equity(49195 [MCRB]) -0.664325 0.097902 -0.086310 -0.046870 3.390976
In [36]:
results.nsmallest(5, 'Daily Return')
Out[36]:
Cumulative Ret Daily Return Mean Return Risk Adj Ret Variance
date security
2017-01-04 00:00:00+00:00 Equity(45735 [EIGI]) -0.239766 -0.161290 -0.009762 -0.031144 0.098247
Equity(26708 [HURN]) -0.207768 -0.090730 -0.047976 -0.043302 1.227529
2017-01-03 00:00:00+00:00 Equity(22166 [SHEN]) -0.332315 -0.056995 -0.053919 -0.036245 2.213010
2017-01-05 00:00:00+00:00 Equity(43713 [PBF]) -0.225337 -0.048155 -0.031258 -0.035433 0.778266
2017-01-04 00:00:00+00:00 Equity(23769 [ELGX]) -0.448345 -0.038462 -0.017738 -0.052782 0.112939
In [ ]: