Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
PCA explained variance problem

Hi all!
I am fairly new to Quantopian. I am trying to use a PCA to get statistical risk factors for the returns of SP500. Attached is a very simple code I have which works, however the results puzzle me. Isn't the 9% explained variance by first PC way too little? I mean CAPM has a R^2 generally around 50-60%, which is basically the amount of explained variance as far as my understanding goes. So shouldn't the first PC explaine AT LEAST as much considering that by construction it represents linear combinations of the underlying in the direction of greatest variance in the data?

Any feedback would be deeply appreciated!

1 response

Hi Gregor,

I am not an expert on PCA, but I was curious myself, and looked into your question, so here are my thoughts. If you look at the formula for calculating explained variance in PCA, it is very similar to that of R-squared. However, the setup of the problem is a bit different. In linear regression, there is a dependent variable of which we are trying to explain the variance with given input features. In PCA, there is no dependent variable. We simply have a bunch of features/variables (in your example, the variables are individual stock returns). PCA is a form of dimensionality reduction. In other words, it tries to reduce the amount of variables we have while maintaining as much "information" (aka variance) as possible. (You may already know this so I apologize if I gave too much information).

I tried experimenting by modifying your notebook a little bit. The first thing I noticed was that you were only using 63 days of data because you dropped any rows where stocks did not have returns data available. I added more stocks and more days while filling in NaNs with 0 values. (I don't know if this is the best way to deal with this, but it seemed to work for now).

To begin with, I calculate the first principal component. The explained variance is a bit higher (21%) than the 9% in your example (probably just due to having more data). I then calculated factor returns using the 1st principal component. Next, I ran some regressions on the resulting factor returns vs. a random stock's returns (you can easily change the stock if you'd like). Note: The r-squared value of the regression depends on the stock you choose.

In my example, I used ticker symbol "GS". The r-squared using the PCA factor was 48.7%. The r-squared from regressing "GS" vs. "SPY" was about 54%. Also, if you regress SPY returns vs. the PCA factor returns, the r-squared was about 74% (i.e. a correlation of about 0.86 between SPY and the PCA factor). Therefore, it does appear that the PCA methodology is picking up on this "market factor".

That is just my 2 cents. I would be interested to hear from others more experienced in the matter.