How to properly do PCA?

Hey Quantopian community

Im new to the quant world (actually, trading/investing in general). The sheer intellectual stimulation i get from reading / learning is quite awesome.

I know it's old but I heard Delaney speak on the chatwithtraders podcast, and there was one thing that sorta got me stumped. He mentioned various smart uses of ML, one of which is dimensionality reduction using principal component analysis.

The way I understood it - PCA helps to eliminate factors that have zero (or little) influence on asset prices.

I had a hypothesis that a certain asset, say Home Depot, was heavily affected by interest rate, inflation, unemployment, CPI, and other "FRED" (fed. reserve eco. data)... kinda 360 degrees like the US currency. So I wanted to use PCA to figure out which of these factors were irrelavant for my modeling.

Now, if so far, my undertanding is correct... i wrote up something which seems to "not work".

Here's the code (using quandl on my desktop, outside quantopian notebook... thus the quandl code):

import os  
import logging  
import sys  # To find out the script name (in argv[0])  
import argparse  
import pandas as pd  
import numpy as np  
import time  
import pickle  
import logging  
from pprint import pprint, pformat  
from sklearn.preprocessing import MinMaxScaler  
from sklearn.decomposition import PCA  
from os import listdir  
from os.path import isfile, join  
import pickle  
import matplotlib.pyplot as plt  
import quandl  
import datetime

# quandl.ApiConfig.api_key = 'QUANDL_KEY' #uncomment if calling more than 50/day

if __name__ == '__main__':  
    df     = quandl.get("EOD/HD")

    df = df[["Adj_Close"]] # look at price only for now

    factors = [  
        #"FRED/BASE", # St. Louis Adjusted Monetary Base  
        "FRED/DFF", # Effective Federal Funds Rate  
        "FRED/UNRATE", # Civilian Unemployment Rate  
        "FRED/CPIAUCSL", # Consumer Price Index for All Urban Consumers: All Items  
        # ... so on.. you get the idea  
    ]

    for col in factors:  
        df[col] = quandl.get(col)

    del(df["Adj_Close"]) # we don't want corrrelation with it self  
    df = df.resample("1D") # since some data is monthly / weekly  
    df = df.interpolate(method='linear') # Good practice?  
    df.dropna(inplace=True)  

    #scale data for PCA  
    from sklearn.preprocessing import StandardScaler,scale  
    df[ df.columns ] = StandardScaler().fit_transform(df[df.columns])  

    pca = PCA()  
    pca.fit_transform(df)

    variance_ratios = pca.explained_variance_ratio_  
    factors_and_pca = dict(zip(df.columns, variance_ratios ))

    pprint (factors_and_pca)

    plt.bar(range(len(factors_and_pca)), factors_and_pca.values(), align='center')  
    plt.xticks(range(len(factors_and_pca)), factors_and_pca.keys(),  rotation='vertical')  
    plt.tight_layout()  
    plt.show()

Now here's the problemo:

Here are couple of problems and questions

1) Data periods - Some are in days, and some are in weeks / months. I am interpolating linearly

1) I noticeed that the 1st column almost always has the highest variance ratio. If I remove this col (FRED/BASE) via (del(df["FRED/BASE"])), the next column, regardless of what it is, becomes the highest variance ratio column.

Here's the original graph: https://s2.postimg.org/uzfaj0xjt/plotted-1.png

After I remove the first column (quandl-FRED-BASE), i get https://s2.postimg.org/vdgmimhnd/plotted-2.png

In other words, every time I shift the array over one, the graph always says that the 1st column has the highest variance
ratio (which naturally sounds wrong)

Am I misunderstood? Is this not how PCA is used?

2) I am still confused - how can PCA figure out variance ratio w/respect to the price if the price column is NOT passed into PCA? I've been following this guideline (http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html) that uses Iris data packaged with scikit learn.

Pravin pretty much got it right. If you have a bunch of potential data sources that could affect the quantity you're trying to model (https://www.quantopian.com/data), then you want to reduce the variables to a smaller number before you build a model. Models that rely on many variables can be bad for many reasons including overfitting and inconsistency.

https://www.quantopian.com/lectures#Instability-of-Estimates
https://www.quantopian.com/lectures#Model-Misspecification

Reducing the number of variables is known as dimensionality reduction, or filtering. PCA is one method to attempt to bring the total number of variables down in your final model, but the final model should still be a simple model at the end of the day. Remember that you want to do most of this research in the research environment and not the backtester. There are two general ways, one is to do all this analysis offline, then come up with a few factors based on that that you hardcode into your model. Another way is to run the PCA regularly in a live fashion. This can actually be more dangerous as you run additional overfitting and estimation risk and your model is constantly changing.

Lastly, predicting the price of any one asset consistently is very very difficult. These models are usually used more cross sectionally so that during the averaging a better signal comes out.

https://www.quantopian.com/lectures#Factor-Analysis