Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
How to find the R^2 values of multiple stocks

I want to find a regression line and R^2 value for the given notebook but can not figure it out. Can someone please help me?

3 responses

The method you should use is linear_regression (https://www.quantopian.com/docs/api-reference/pipeline-api-reference#zipline.pipeline.Factor.linear_regression). That method will perform a linear least-squares regression for two sets of values. The target, or independent values, can be either a single 1D set of values (eg SPY returns), or a a 2D set of values which the method then pairs asset by asset. The method returns a factor with 5 outputs

  • alpha (intercept)
  • beta (slope)
  • r_value
  • p_value
  • stderr

The method is equivalent to the scipy.stats.linregress method (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html#scipy.stats.linregress). The outputs are named a bit different though to align with the conventional investing terms.

I couldn't discern what values you were trying to regress from the notebook above. So, let's use the example of regressing returns against the returns of SPY. This will return the Alpha, Beta, and other attributes of the regression. A pipeline like below would work for that


def make_spy_correlation_pipeline():  
    # Universe we wish to trade  
    my_universe = Q500US()  
    spy = symbols('SPY')

    # Ensure we include SPY in our universe  
    total_universe = my_universe | StaticAssets([spy])

    # Create any needed factors.  
    # Get the 2 day returns for each asset (including SPY)  
    returns = Returns(window_length=2)

    # Now make a 'slice' of data representing just the returns of SPY  
    spy_returns = returns[spy]

    # Use the 'linear_regression' method to get all the regression attributes  
    # for each asset returns vs SPY returns.  
    # Check the regression over the past quarter (about 63 trading days)  
    # We don't really need to use a mask but it sometimes speeds things up  
    regression = returns.linear_regression(target=spy_returns, regression_length=63, mask=total_universe)

    # Create our pipeline  
    # The regression factor has multiple outputs. Use dot notation to access each separately  
    # Also, square the r_ value to get R Squared  
    pipe = Pipeline(  
        columns={  
            'returns': returns,  
            'alpha': regression.alpha,  
            'beta': regression.beta,  
            'r_value': regression.r_value,  
            'p_value': regression.p_value,  
            'stderr': regression.stderr,  
            'r_squared': regression.r_value ** 2,  
        },  
        screen=total_universe,  
    )

    return pipe 

See the attached notebook for this in action. Good luck.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Dan,

Thank you so much for the timely response. I think that I misspoke when I said linear regression. I have a dataframe of 165 stocks that I would like to run a line of best fit on each of them with the dates on the x-axis and price on the y-axis. From this data, I would like to obtain an R^2 value for each of the plots to see how linear the trend of the stock is. Please let me know if this possible/if you can, can you show me. I have attached an updated notebook. Also are you able to save and create custom universes? Thank you so much I appreciate your help.

The first place to always check are the built in pipeline factors. While there is the linear_regression method as noted above, it really expects two datasets to regress against each other. It isn't set up for 'best fit' line analysis. So, the next place to look is pandas (https://pandas.pydata.org/pandas-docs/version/0.18/api.html). The nice thing about pandas methods is they often automatically work across multiple columns, or stocks in this case. However, unfortunately there isn't a built in pandas linregress method. So, now check numpy and scipy or statsmodels. Those are the 'go to' modules for statistical methods. They actually all have their own version or versions to get an r squared value. I'll choose linregress from scipy.stats (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) because I'm most familiar with it.

Now that we have found a function which gives us r squared for a single line, the next step is to use the pandas apply method to apply this function to all columns in a dataframe (https://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.apply.html). The one issue here is often the inputs to the function aren't always the same as those used in the apply method. The apply method really wants to pass a single pandas series, representing each column of the dataframe, to the function. The linregress function expects separate x values and y values. Moreover, the pandas series has a datetime index which linregress doesn't know how to handle. The solution? Wrap linregress with a custom function. Something like this

def get_r_squared(data_series=None):

    # Use the scipy linregress method. It expects X values and Y values to be stated explicitly  
    # The x values also can't be timestamps. So just reset the index to get integers  
    # Set drop=True to not save the index, inplace=true to not create a new series  
    data_series.reset_index(drop=True, inplace=True)  
    x_values = data_series.index.values  
    y_values = data_series.values

    r_squared = linregress(x_values, y_values).rvalue ** 2

    return r_squared

Now, apply that function to a dataframe of prices. Something like this.

stock_prices = get_pricing(['AAPL', 'CAT', 'IBM'], fields='price')

# Now simply use the `apply` method to get the r squared value for each stock  
r_squared_values = stock_prices.apply(get_r_squared)

That's it. Check out the attached notebook for more explanation. There are also some cells at the end for how to turn this into a custom factor to get r squared using pipeline and/or use in an algo. Good luck.