Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
How to properly do PCA?

Hey Quantopian community

Im new to the quant world (actually, trading/investing in general). The sheer intellectual stimulation i get from reading / learning is quite awesome.

I know it's old but I heard Delaney speak on the chatwithtraders podcast, and there was one thing that sorta got me stumped. He mentioned various smart uses of ML, one of which is dimensionality reduction using principal component analysis.

The way I understood it - PCA helps to eliminate factors that have zero (or little) influence on asset prices.

I had a hypothesis that a certain asset, say Home Depot, was heavily affected by interest rate, inflation, unemployment, CPI, and other "FRED" (fed. reserve eco. data)... kinda 360 degrees like the US currency. So I wanted to use PCA to figure out which of these factors were irrelavant for my modeling.

Now, if so far, my undertanding is correct... i wrote up something which seems to "not work".

Here's the code (using quandl on my desktop, outside quantopian notebook... thus the quandl code):

import os  
import logging  
import sys  # To find out the script name (in argv[0])  
import argparse  
import pandas as pd  
import numpy as np  
import time  
import pickle  
import logging  
from pprint import pprint, pformat  
from sklearn.preprocessing import MinMaxScaler  
from sklearn.decomposition import PCA  
from os import listdir  
from os.path import isfile, join  
import pickle  
import matplotlib.pyplot as plt  
import quandl  
import datetime

# quandl.ApiConfig.api_key = 'QUANDL_KEY' #uncomment if calling more than 50/day

if __name__ == '__main__':  
    df     = quandl.get("EOD/HD")

    df = df[["Adj_Close"]] # look at price only for now

    factors = [  
        #"FRED/BASE", # St. Louis Adjusted Monetary Base  
        "FRED/DFF", # Effective Federal Funds Rate  
        "FRED/UNRATE", # Civilian Unemployment Rate  
        "FRED/CPIAUCSL", # Consumer Price Index for All Urban Consumers: All Items  
        # ... so on.. you get the idea  
    ]

    for col in factors:  
        df[col] = quandl.get(col)

    del(df["Adj_Close"]) # we don't want corrrelation with it self  
    df = df.resample("1D") # since some data is monthly / weekly  
    df = df.interpolate(method='linear') # Good practice?  
    df.dropna(inplace=True)  

    #scale data for PCA  
    from sklearn.preprocessing import StandardScaler,scale  
    df[ df.columns ] = StandardScaler().fit_transform(df[df.columns])  

    pca = PCA()  
    pca.fit_transform(df)

    variance_ratios = pca.explained_variance_ratio_  
    factors_and_pca = dict(zip(df.columns, variance_ratios ))

    pprint (factors_and_pca)

    plt.bar(range(len(factors_and_pca)), factors_and_pca.values(), align='center')  
    plt.xticks(range(len(factors_and_pca)), factors_and_pca.keys(),  rotation='vertical')  
    plt.tight_layout()  
    plt.show()

Now here's the problemo:

Here are couple of problems and questions

1) Data periods - Some are in days, and some are in weeks / months. I am interpolating linearly

1) I noticeed that the 1st column almost always has the highest variance ratio. If I remove this col (FRED/BASE) via (del(df["FRED/BASE"])), the next column, regardless of what it is, becomes the highest variance ratio column.

Here's the original graph: https://s2.postimg.org/uzfaj0xjt/plotted-1.png

After I remove the first column (quandl-FRED-BASE), i get https://s2.postimg.org/vdgmimhnd/plotted-2.png

In other words, every time I shift the array over one, the graph always says that the 1st column has the highest variance
ratio (which naturally sounds wrong)

Am I misunderstood? Is this not how PCA is used?

2) I am still confused - how can PCA figure out variance ratio w/respect to the price if the price column is NOT passed into PCA? I've been following this guideline (http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html) that uses Iris data packaged with scikit learn.

2 responses
  1. PCA identifies factors that explain maximum variance ranked from highest to lowest.
  2. PCA can be used to reduce the dimension of dataset. So in your case, you could run PCA to identify uncorrelated factors that explain maximum variance.
  3. Lastly, you have to regress returns on these factors. Look out for stationarity before regression.

Pravin pretty much got it right. If you have a bunch of potential data sources that could affect the quantity you're trying to model (https://www.quantopian.com/data), then you want to reduce the variables to a smaller number before you build a model. Models that rely on many variables can be bad for many reasons including overfitting and inconsistency.

https://www.quantopian.com/lectures#Instability-of-Estimates
https://www.quantopian.com/lectures#Model-Misspecification

Reducing the number of variables is known as dimensionality reduction, or filtering. PCA is one method to attempt to bring the total number of variables down in your final model, but the final model should still be a simple model at the end of the day. Remember that you want to do most of this research in the research environment and not the backtester. There are two general ways, one is to do all this analysis offline, then come up with a few factors based on that that you hardcode into your model. Another way is to run the PCA regularly in a live fashion. This can actually be more dangerous as you run additional overfitting and estimation risk and your model is constantly changing.

Lastly, predicting the price of any one asset consistently is very very difficult. These models are usually used more cross sectionally so that during the averaging a better signal comes out.

https://www.quantopian.com/lectures#Factor-Analysis

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.