Hey Quantopian community
Im new to the quant world (actually, trading/investing in general). The sheer intellectual stimulation i get from reading / learning is quite awesome.
I know it's old but I heard Delaney speak on the chatwithtraders podcast, and there was one thing that sorta got me stumped. He mentioned various smart uses of ML, one of which is dimensionality reduction using principal component analysis.
The way I understood it - PCA helps to eliminate factors that have zero (or little) influence on asset prices.
I had a hypothesis that a certain asset, say Home Depot, was heavily affected by interest rate, inflation, unemployment, CPI, and other "FRED" (fed. reserve eco. data)... kinda 360 degrees like the US currency. So I wanted to use PCA to figure out which of these factors were irrelavant for my modeling.
Now, if so far, my undertanding is correct... i wrote up something which seems to "not work".
Here's the code (using quandl on my desktop, outside quantopian notebook... thus the quandl code):
import os
import logging
import sys # To find out the script name (in argv[0])
import argparse
import pandas as pd
import numpy as np
import time
import pickle
import logging
from pprint import pprint, pformat
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from os import listdir
from os.path import isfile, join
import pickle
import matplotlib.pyplot as plt
import quandl
import datetime
# quandl.ApiConfig.api_key = 'QUANDL_KEY' #uncomment if calling more than 50/day
if __name__ == '__main__':
df = quandl.get("EOD/HD")
df = df[["Adj_Close"]] # look at price only for now
factors = [
#"FRED/BASE", # St. Louis Adjusted Monetary Base
"FRED/DFF", # Effective Federal Funds Rate
"FRED/UNRATE", # Civilian Unemployment Rate
"FRED/CPIAUCSL", # Consumer Price Index for All Urban Consumers: All Items
# ... so on.. you get the idea
]
for col in factors:
df[col] = quandl.get(col)
del(df["Adj_Close"]) # we don't want corrrelation with it self
df = df.resample("1D") # since some data is monthly / weekly
df = df.interpolate(method='linear') # Good practice?
df.dropna(inplace=True)
#scale data for PCA
from sklearn.preprocessing import StandardScaler,scale
df[ df.columns ] = StandardScaler().fit_transform(df[df.columns])
pca = PCA()
pca.fit_transform(df)
variance_ratios = pca.explained_variance_ratio_
factors_and_pca = dict(zip(df.columns, variance_ratios ))
pprint (factors_and_pca)
plt.bar(range(len(factors_and_pca)), factors_and_pca.values(), align='center')
plt.xticks(range(len(factors_and_pca)), factors_and_pca.keys(), rotation='vertical')
plt.tight_layout()
plt.show()
Now here's the problemo:
Here are couple of problems and questions
1) Data periods - Some are in days, and some are in weeks / months. I am interpolating linearly
1) I noticeed that the 1st column almost always has the highest variance ratio. If I remove this col (FRED/BASE) via (del(df["FRED/BASE"])), the next column, regardless of what it is, becomes the highest variance ratio column.
Here's the original graph: https://s2.postimg.org/uzfaj0xjt/plotted-1.png
After I remove the first column (quandl-FRED-BASE), i get https://s2.postimg.org/vdgmimhnd/plotted-2.png
In other words, every time I shift the array over one, the graph always says that the 1st column has the highest variance
ratio (which naturally sounds wrong)
Am I misunderstood? Is this not how PCA is used?
2) I am still confused - how can PCA figure out variance ratio w/respect to the price if the price column is NOT passed into PCA? I've been following this guideline (http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html) that uses Iris data packaged with scikit learn.