Calculating rolling regression coefficients of a DataFrame

Back to Community

posted

Hi - I'm new to Python. I've managed to successfully code an algo in ipython notebook and now I'm working on converting it to Quantopian.

I'm getting an error when trying to compute a rolling 30 day calculation of regression coefficients for a table of volatility calculations. Here is what I have so far:

vol_table=pd.DataFrame({'x':x_output,'y':y_output,'z':z_output}, index=ind)

    #OLS Regression Setup  
    X = vol_table  
    Y = inddx_output  
    model=pd.ols(y=Y, x=X, window=30)  

    #Rolling Volatility  
    beta_df=model.beta  
    d_beta=beta_df['x']  
    w_beta=beta_df['y']  
    m_beta=beta_df['z']  
    inter=beta_df['intercept']

Here is the error I'm getting on the model=pd.ols line:

Runtime exception: ValueError: failed to create intent(cache|hide)|optional array-- must have defined dimensions but got (0,)

I'm not sure if Quantopian supports pandas rolling regression? I have no idea what this error means. Any insight would be appreciated!

21 responses

Erick Gomez

mmm Are you sure that you have data in x and y? you can check it via the debugger. Comment the lines below just to be sure

Mark Olivieri

Thanks for the tip. I was able to find the issue. I have two lines of code but for some reason daily_log output is the same as daily_log_mean resulting in a zero value later in my algorithm since I'm subtracting the two. Does the expanding_mean function work in Quantopian environment? The np.log1p is working correctly.

    daily_log=np.log1p(daily_ret).fillna(0)  
    daily_log_mean=pd.expanding_mean(daily_log)

I wonder if it has to do with the history container?

    daily_p = history(bar_count=2, frequency='1d', field='price')  
    daily= (daily_p.ix[-1] - daily_p.ix[0]) / daily_p.ix[0]

Is there a way to run an expanding mean across each date in the history container? In the above example, I'm using a 2 bar count which is probably why daily_log = daily_log_mean.

Any recommendations on how to perform an expanding_mean across all dates?

Grant Kiehne

Hi Mark,

Note that Pandas supports a generic rolling_apply, which can be used. I can work up an example, if it'd be helpful. It turns out that one has to do some coding gyrations for the case of multiple inputs and outputs. If you're still stuck, just let me know.

Grant

Mark Olivieri

Thanks Grant. Here is where I'm stuck:

    daily_p = history(bar_count=2, frequency='1d', field='price')  
    daily_ret= (daily_p.ix[-1] - daily_p.ix[0]) / daily_p.ix[0]  
    daily_rets=daily_ret.fillna(0)  
    daily_log=np.log1p(daily_rets).fillna(0)  
    daily_log_mean=pd.expanding_mean(daily_log)  
    sqdev_daily=(daily_log-daily_log_mean)**2.

sqdv_daily = 0 in the Quantopian environment (but not in the iPython environment) - so the rest of my alg is not working. I think it has to do with Quantopians History containers. In this case, I want a daily avg return so I'm pulling in two 1-day bars for each date. I than am taking the return of daily_p which is outputting x number of pandas Series where x = length of backtest. My thought from here was to join each Series using a for loop with i in daily_log: list.append(i). The goal was to create a single dataframe or series from the for loop. But this did not work either. All outputs in the above are correct except once I get to daily_log_mean. There is no single series or dataframe to expand the mean (I think this is the problem)

I have such an easier time developing in iPython but once I bring to Quantopian, everything falls apart!

Mark Olivieri

I should have noted, I'm not opposed to using a rolling_mean. I did make a few adjustments, but rolling_mean is not helping either. Here is what I tried:

    daily_p = history(bar_count=100, frequency='1d', field='price') #expanded bar_count to have bar history for calculating rolling_mean  
    daily= (daily_p.ix[-1] - daily_p.ix[-2]) / daily_p.ix[-2]  
    daily_ret=daily.fillna(0)  
    daily_log=np.log1p(daily_ret).fillna(0)  
    daily_log_mean=pd.rolling_mean(daily_log, 5)  
    print daily_log_mean  
    >>>2011-01-04PRINTSecurity(8554 [SPY])   NaN  
          dtype: float64  
          2011-01-05PRINTSecurity(8554 [SPY])   NaN  
          dtype: float64  
          ...

Grant Kiehne

Mark,

It's kinda quick and dirty but maybe you want something like this:

import numpy as np  
import pandas as pd

def initialize(context):  
    context.spy = sid(8554)

def handle_data(context, data):  
    daily_p = history(bar_count=100, frequency='1d', field='price')  
    daily_ret = daily_p.pct_change()  
    daily_log = np.log1p(daily_ret)  
    daily_log_mean = pd.rolling_mean(daily_log, 5)  
    print daily_log_mean.tail(5)

I think that in calculating 'daily' you were creating a scalar, so the rolling mean was appropriately giving a NaN.

By the way, if you post a follow-up, please attach a backtest, since it is much easier to troubleshoot this way.

Grant

Mark Olivieri

Thanks again Grant! One other quick question: if I wanted to do the same but for rolling weekly returns (not Friday to Friday but weekly for each day of the week), how would I accomplish that without creating a scalar?

Grant Kiehne

Mark,

I'm not exactly sure what you mean by "weekly for each day of the week" but if you dig into pct_change() you'll see that you can set the periods parameter (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html). So, I think you'd want pct_change(periods = 5), so for example, on a Monday, you'd be comparing to the prior Monday, on a Tuesday, to the prior Tuesday, etc. (with the exception of incomplete trading weeks, which are sorta rare).

Grant

Mark Olivieri

Grant - can I collaborate with you separately? I have two dataframes - df_x which has 5 independent variables and df_y which is the independent variable. I'm trying to perform a pd.ols rolling regression but I'm getting the following error on my model=pd.ols(y,x,window=30) line.

Runtime exception: AssertionError: Must have 2-level MultiIndex

Please let me know if I can share. Or if you have any ideas on what might be causing this error.

Thanks.

Grant Kiehne

Hi Mark,

We can work through things here on the forum, if you want. It might be a day or two before I can get to it, though.

Grant

Grant Kiehne

Hello Mark,

Are you trying to apply a model like this:

y = m0*x0 + m1*x1 + m2*x2+ m3*x3 + m4*x4 + b

where the m's and b are constants? Is this what you mean by "5 independent variables"?

Also, when you say "rolling regression" do you just need the m's and b at a single point in time (e.g. the current call to handle_data)? Or do you actually need to roll over a set of x's and y's versus time, so that you then have m's and b's versus time? Either way it's doable, but implementation of the latter will take a bit more effort.

Grant

Mark Olivieri

Hi Grant,

That is correct, ms and bs versus time - which is what I meant by rolling regression. For the purposes of backtesting, the rolling regression needs to occur because as I walk forward, I am predicting tomorrow's volatility based on today's ms and bs. I wish we could collaborate privately so that I can share with you my code, but it doesn't look like Quantopian supports this functionality yet??

Grant Kiehne

Hello Mark,

I think that Quantopian has some kind of code sharing thingy, but the forum seems like the right place if you are having coding problems. Just strip out whatever you consider "secret sauce" and share the rest.

Just so I understand, for a typical trading day, you'd end up computing 390 sets of m's and 390 b's. Then, at opening the next day, you could use those m's and b's to make a prediction?

If you are willing to give up one day of trading, this could be done by storing the m's and b's in context. Otherwise, you'd have to use the history API and compute on a rolling basis at the start of the backtest, rolling over the trailing window. Both could be done, but the former is a little easier (and more efficient, too), I think.

No time now, but tomorrow or Friday, I might be able to give it a go.

Grant

Mark Olivieri

Correct on the 390 sets of m's and b's to predict for the next day. Below is the code up until the regression so that you can see the error:

import pandas as pd
import numpy as np
import math as m
from itertools import repeat
from datetime import datetime
import statsmodels.api as sm

x=2
y=3
z=4
rw=30 #Regression Rolling Window

def initialize(context):
context.stocks = symbol('SPY')
context.position_closed = {sid(8554) : True}
context.order_size = 25
context.position_cost = 0
set_commission(commission.PerTrade(cost=0.0))
set_slippage(slippage.FixedSlippage(spread=0.00))

def handle_data(context, data):
#for stock in context.stocks:
close_price = data[context.stocks].close_price
current_price = data[context.stocks].price

a_p = history(bar_count=x*30, frequency='1d', field='price')  
a = a_p.pct_change()  
a.columns=['a']  
b_p = history(bar_count=y*30, frequency='1d', field='price')  
b = b_p.pct_change(periods=y)  
b.columns=['b']  
c_p = history(bar_count=z*30, frequency='1d', field='price')  
c = c_p.pct_change(periods=z)  
c.columns=['c']  

#RV-a  
def avol(a):  
    a_ret=a.fillna(0)  
    a_log=np.log1p(a_ret).fillna(0)  
    a_log_mean=pd.rolling_mean(a_log, 30).fillna(0)  
    sqdev_a=(a_log-a_log_mean)**2.  
    avg_sqdev_a=pd.rolling_sum(sqdev_a, window=x)/x  
    a_vol=np.sqrt(avg_sqdev_a).shift().fillna(0)  
    return a_vol

# RV-a, 1 day ahead - independent variable for regression ols  
def indavol(a):  
    ia_ret=a.fillna(0)  
    ia_log=np.log1p(ia_ret).fillna(0)  
    ia_log_mean=pd.rolling_mean(ia_log, 30).fillna(0)  
    sqdev_ia=(ia_log-ia_log_mean)**2.  
    avg_sqdev_ia=pd.rolling_sum(sqdev_ia, window=x)/x  
    ind_a_vol=np.sqrt(avg_sqdev_ia).fillna(0)  
    return ind_a_vol  

#RV-b  
def bvol(b):  
    b_ret=b.fillna(0)  
    b_log=np.log1p(b_ret).fillna(0)  
    b_log_mean=pd.rolling_mean(b_log, 30).fillna(0)  
    sqdev_b=(b_log-b_log_mean)**2.  
    avg_sqdev_b=pd.rolling_sum(sqdev_b, window=y)/y  
    b_vol=np.sqrt(avg_sqdev_b).shift().fillna(0)  
    return b_vol  

#RV-c  
def cvol(c):  
    c_ret=c.fillna(0)  
    c_log=np.log1p(c_ret).fillna(0)  
    c_log_mean=pd.rolling_mean(c_log, 30).fillna(0)  
    sqdev_c=(c_log-c_log_mean)**2.  
    avg_sqdev_c=pd.rolling_sum(sqdev_c, window=z)/z  
    c_vol=np.sqrt(avg_sqdev_c).shift().fillna(0)  
    return c_vol

a_output=avol(a).tail(1)  
a_output=pd.Series(a_output['a'][-1])  
inda_output=indavol(a).tail(1)  
inda_output1=pd.Series(inda_output['a'][-1])  
ind=inda_output.index  
b_output=bvol(b).tail(1)  
b_output=pd.Series(b_output['b'][-1])  
c_output=cvol(c).tail(1)  
c_output=pd.Series(c_output['c'][-1])  
vol_table=pd.DataFrame({'a':a_output,'b':b_output,'c':c_output}, index=ind)  
ind_table=pd.DataFrame({'ind':inda_output1}, index=ind)  

#OLS Regression Setup  
X = vol_table  
Y = ind_table  

model=pd.ols(y=Y, x=X, window=rw)

Grant Kiehne

Thanks Mark,

I suspect the problem is either due to incompatible x's and y's or or maybe your window is larger than the number of rows?

Can you post the code above as an attached, running backtest, with the lines that cause the error commented out? Then it'll be straightforward to see if what's going into pd.ols(y=Y, x=X, window=rw) makes sense.

Grant

Mark Olivieri

Here you go

Grant Kiehne

I added some print statements (see attached), and the output was (run attached backtest and see log output):

2011-01-04PRINT                                  a         b         c  
2011-01-04 00:00:00+00:00  0.005775  0.004482  0.004548  
2011-01-04PRINT                                ind  
2011-01-04 00:00:00+00:00  0.005963

So, you are trying to do multiple regression on a single row of data. You'll need to figure out what's going on in your code, so that 'a', 'b', and 'c' are columns and 'ind' is a column, with the same number of rows.

Also, I'm confused. Above you said you want to fit to this model:

y = m0*x0 + m1*x1 + m2*x2+ m3*x3 + m4*x4 + b

However, you only have 3 independent variables (a, b, & c). You'll need 5, right?

Grant

Mark Olivieri

Grant - I thought that creating lines 86 & 87 (dataframes) would put a,b,c,and ind variable in column format? Your print statement shows that a,b,and c are in the columns I have 3 independent and 1 dependent variable, so the regression is:

y = m0*x0 + m1*x1 + m2*x2 + b

Grant Kiehne

Hi Mark,

Sorry, I need to leave it to you to sort out how to get the data in shape for the fit. Maybe the Quantopian help desk or another user can lend a hand?

Grant

Mark Olivieri

No problem. I managed to figure out the problem. I have one final question about the trading logic. I have attached an example version of my trading logic with using a simple SMA example. For some reason, my trading logic is generating buy signals everyday and not generating any sell signals. Two problems with this - 1) it shouldn't be generating buy signals everyday and 2) obviously, something is preventing a sell signal from triggering.

Would you mind taking a look to provide your feedback?

Grant Kiehne

Hi Mark,

I glanced over the code, but nothing obvious pops out. You might try the debugger. Adding various print statements and using 'record' can be handy, too.

Grant

You've successfully submitted a support ticket.

Our support team will be in touch soon.