Notebook

Analyzing Alpha in 10-Ks and 10-Qs (Alphalens Study)

THESIS:

Major text changes in 10-K and 10-Q filings over time indicate significant decreases in future returns. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.

Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.

When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.

The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns. For an overview of this paper from one of the authors, see the Lazy Prices interview from QuantCon 2018.

To understand how the dataset used in this post was created, be sure to see the Data Processing notebook.

In [3]:
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.filters import QTradableStocksUS
import alphalens

1. Loading Data from Self-Serve Data

In this step, we import the data from a local .csv file via the Self-Serve Data feature.

To do this, we begin with the local .csv file generated by the Data Processing notebook. We then upload it under the Self-Serve Data tab on the Account > Data page; this makes it available for import into a research notebook or pipeline.

For more on importing data using Self-Serve, check out the examples in this forum post.

In [4]:
from quantopian.pipeline.data.user_5b102ae91141120040958556 import lazyprices3_90d

2. Formatting Factor Values

The data we uploaded is in a tabular form, with one row per asset per day. However, Alphalens requires that we provide data in a specific format with specific labels. Fortunately, Pipeline will do all the "dirty work" for us.

In this step, we'll use Pipeline to put our data in a form that can be ingested by Alphalens.

In [5]:
def make_pipeline():
    
    jaccard_score = lazyprices3_90d.jaccard_score.latest
    cosine_score = lazyprices3_90d.cosine_score.latest
    
    screen = (QTradableStocksUS() & jaccard_score.notnull() & cosine_score.notnull())
    
    return Pipeline(columns={'jaccard_score': jaccard_score, 'cosine_score': cosine_score}, screen=screen)
In [6]:
data = run_pipeline(make_pipeline(), '2013-01-01', '2018-05-01')

3. Get Pricing Data

Since an alpha factor is supposed to predict the returns of an asset, we'll need to get records of the actual price of the asset in order to examine the performance of our alpha factor. In this step, we get pricing data for the assets in our dataset.

In [11]:
# Get list of relevant assets
assets = data.index.levels[1]
In [12]:
# Get pricing data for those assets
pricing_end_date = '2018-08-01' # Pricing end date should be later so we can get forward returns
prices = get_pricing(assets,
                     start_date='2013-01-01',
                     end_date=pricing_end_date,
                     fields='open_price')

4. Run Alphalens

Now that we have both our alpha factor and pricing datasets, we're ready to run our Alphalens study.

Since we have both Jaccard and cosine similarity scores, we'll run two separate Alphalens tearsheets.

4a. Jaccard Similarity Factor

Before creating a tearsheet, we'll use get_clean_factor_and_forward_returns to get our data in the correct format to be ingested by Alphalens.

Note on parameters:

The periods parameter in get_clean_factor_and_forward_returns allows us to set the periods over which we assess the performance of our alpha factor (in days). Here, we'll use longer periods, since political processes tend to be longer-term phenomena.

The quantiles parameter allows us to set the number of bins into which we divide our assets based on their factor values. Since the original paper uses 5 quantiles to estimate portfolio performance, we'll also use 5 quantiles.

In [8]:
jaccard_factor = data[['jaccard_score']]

Shorter periods (1, 5, 10 days)

In [32]:
factor_data_j1 = alphalens.utils.get_clean_factor_and_forward_returns(
    jaccard_factor,
    prices=prices,
    quantiles=5,
    periods =(1, 5, 10),
)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [33]:
alphalens.tears.create_full_tear_sheet(factor_data_j1, by_group=False);
Quantiles Statistics
min max mean std count count %
factor_quantile
1 0.225926 0.745363 0.625227 0.058582 294577 20.023097
2 0.659586 0.797164 0.704151 0.030120 294080 19.989315
3 0.705064 0.831345 0.745011 0.029979 294068 19.988499
4 0.745190 0.861789 0.784330 0.028935 294075 19.988975
5 0.785714 1.000000 0.840791 0.031515 294386 20.010114
Returns Analysis
1D 5D 10D
Ann. alpha 0.023 0.023 0.022
beta -0.004 -0.023 -0.027
Mean Period Wise Return Top Quantile (bps) 1.338 1.161 1.015
Mean Period Wise Return Bottom Quantile (bps) -0.791 -0.695 -0.704
Mean Period Wise Spread (bps) 2.170 1.873 1.723
<matplotlib.figure.Figure at 0x7f37cc37da50>
Information Analysis
1D 5D 10D
IC Mean 0.004 0.007 0.009
IC Std. 0.041 0.042 0.042
Risk-Adjusted IC 0.106 0.161 0.208
t-stat(IC) 3.058 4.667 6.032
p-value(IC) 0.002 0.000 0.000
IC Skew -0.054 0.018 -0.109
IC Kurtosis 0.180 -0.193 -0.262
Turnover Analysis
10D 1D 5D
Quantile 1 Mean Turnover 0.096 0.013 0.054
Quantile 2 Mean Turnover 0.151 0.022 0.089
Quantile 3 Mean Turnover 0.164 0.025 0.101
Quantile 4 Mean Turnover 0.154 0.023 0.094
Quantile 5 Mean Turnover 0.108 0.015 0.062
1D 5D 10D
Mean Factor Rank Autocorrelation 0.992 0.963 0.93

Midrange periods (1, 2, 3 months)

In [11]:
factor_data_j2 = alphalens.utils.get_clean_factor_and_forward_returns(
    jaccard_factor,
    prices=prices,
    quantiles=5,
    periods =(20, 40, 60),
)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [12]:
alphalens.tears.create_full_tear_sheet(factor_data_j2, by_group=False);
Quantiles Statistics
min max mean std count count %
factor_quantile
1 0.225926 0.745363 0.625249 0.058598 294497 20.021905
2 0.659586 0.797164 0.704165 0.030122 294021 19.989544
3 0.705731 0.831345 0.745027 0.029980 294014 19.989068
4 0.745230 0.861789 0.784344 0.028932 294017 19.989272
5 0.785714 1.000000 0.840801 0.031511 294325 20.010212
Returns Analysis
20D 40D 60D
Ann. alpha 0.025 0.026 0.026
beta -0.040 -0.057 -0.060
Mean Period Wise Return Top Quantile (bps) 18.899 19.663 18.713
Mean Period Wise Return Bottom Quantile (bps) -15.490 -14.392 -14.657
Mean Period Wise Spread (bps) 34.396 33.724 33.079
<matplotlib.figure.Figure at 0x7f37c48eb090>
Information Analysis
20D 40D 60D
IC Mean 0.012 0.018 0.021
IC Std. 0.043 0.039 0.039
Risk-Adjusted IC 0.285 0.446 0.540
t-stat(IC) 8.263 12.909 15.636
p-value(IC) 0.000 0.000 0.000
IC Skew -0.100 0.217 -0.176
IC Kurtosis -0.324 -0.155 -0.116
Turnover Analysis
20D 40D 60D
Quantile 1 Mean Turnover 0.170 0.298 0.378
Quantile 2 Mean Turnover 0.250 0.413 0.516
Quantile 3 Mean Turnover 0.265 0.437 0.547
Quantile 4 Mean Turnover 0.249 0.409 0.507
Quantile 5 Mean Turnover 0.188 0.328 0.411
20D 40D 60D
Mean Factor Rank Autocorrelation 0.867 0.746 0.676

Longest periods (1.5, 3, 4.5 months)

In [34]:
factor_data_j3 = alphalens.utils.get_clean_factor_and_forward_returns(
    jaccard_factor,
    prices=prices,
    quantiles=5,
    periods =(30, 60, 90),
)
Dropped 3.4% entries from factor data: 3.4% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [35]:
alphalens.tears.create_full_tear_sheet(factor_data_j3, by_group=False);
Quantiles Statistics
min max mean std count count %
factor_quantile
1 0.225926 0.745363 0.625607 0.058730 284462 20.022637
2 0.659586 0.797164 0.704506 0.030495 283987 19.989203
3 0.705731 0.831345 0.745427 0.030366 283983 19.988921
4 0.745230 0.861789 0.784718 0.029259 283983 19.988921
5 0.785714 1.000000 0.841041 0.031631 284287 20.010319
Returns Analysis
30D 60D 90D
Ann. alpha 0.025 0.025 0.022
beta -0.049 -0.061 -0.062
Mean Period Wise Return Top Quantile (bps) 27.898 27.322 24.926
Mean Period Wise Return Bottom Quantile (bps) -25.347 -24.141 -19.310
Mean Period Wise Spread (bps) 52.827 50.880 43.530
<matplotlib.figure.Figure at 0x7f37b133af50>
Information Analysis
30D 60D 90D
IC Mean 0.015 0.021 0.021
IC Std. 0.043 0.039 0.040
Risk-Adjusted IC 0.347 0.528 0.520
t-stat(IC) 9.879 15.051 14.831
p-value(IC) 0.000 0.000 0.000
IC Skew 0.136 -0.155 -0.274
IC Kurtosis -0.188 -0.149 -0.359
Turnover Analysis
30D 60D 90D
Quantile 1 Mean Turnover 0.245 0.390 0.455
Quantile 2 Mean Turnover 0.347 0.533 0.601
Quantile 3 Mean Turnover 0.368 0.564 0.629
Quantile 4 Mean Turnover 0.345 0.523 0.587
Quantile 5 Mean Turnover 0.270 0.424 0.475
30D 60D 90D
Mean Factor Rank Autocorrelation 0.799 0.665 0.612

4b. Cosine Similarity Factor

We'll put our cosine score factor through the same process.

In [9]:
cosine_factor = data[['cosine_score']]
In [16]:
factor_data_c1 = alphalens.utils.get_clean_factor_and_forward_returns(
    cosine_factor,
    prices=prices,
    quantiles=5,
    periods =(1, 5, 10),
)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [17]:
alphalens.tears.create_full_tear_sheet(factor_data_c1, by_group=False);
Quantiles Statistics
min max mean std count count %
factor_quantile
1 0.430380 0.854624 0.769615 0.044902 294577 20.023097
2 0.796210 0.887623 0.826529 0.020148 294075 19.988975
3 0.827838 0.907965 0.853852 0.019162 294075 19.988975
4 0.854240 0.925837 0.879042 0.017803 294073 19.988839
5 0.880126 1.000000 0.913292 0.018437 294386 20.010114
Returns Analysis
1D 5D 10D
Ann. alpha 0.023 0.023 0.023
beta -0.006 -0.026 -0.029
Mean Period Wise Return Top Quantile (bps) 1.311 1.129 0.984
Mean Period Wise Return Bottom Quantile (bps) -0.886 -0.773 -0.768
Mean Period Wise Spread (bps) 2.242 1.927 1.763
<matplotlib.figure.Figure at 0x7f37ba7b07d0>
Information Analysis
1D 5D 10D
IC Mean 0.004 0.007 0.009
IC Std. 0.041 0.042 0.042
Risk-Adjusted IC 0.105 0.160 0.207
t-stat(IC) 3.036 4.640 5.997
p-value(IC) 0.002 0.000 0.000
IC Skew -0.055 0.019 -0.108
IC Kurtosis 0.185 -0.190 -0.263
Turnover Analysis
10D 1D 5D
Quantile 1 Mean Turnover 0.095 0.013 0.054
Quantile 2 Mean Turnover 0.151 0.022 0.089
Quantile 3 Mean Turnover 0.164 0.025 0.101
Quantile 4 Mean Turnover 0.154 0.023 0.094
Quantile 5 Mean Turnover 0.108 0.015 0.062
1D 5D 10D
Mean Factor Rank Autocorrelation 0.992 0.963 0.93

Midrange periods (1, 2, 3 months)

In [18]:
factor_data_c2 = alphalens.utils.get_clean_factor_and_forward_returns(
    cosine_factor,
    prices=prices,
    quantiles=5,
    periods =(20, 40, 60),
)
Dropped 0.0% entries from factor data: 0.0% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [19]:
alphalens.tears.create_full_tear_sheet(factor_data_c2, by_group=False);
Quantiles Statistics
min max mean std count count %
factor_quantile
1 0.430380 0.854624 0.769630 0.044914 294497 20.021905
2 0.796210 0.887623 0.826539 0.020149 294017 19.989272
3 0.827915 0.907965 0.853862 0.019163 294020 19.989476
4 0.854252 0.925837 0.879051 0.017802 294015 19.989136
5 0.880126 1.000000 0.913298 0.018435 294325 20.010212
Returns Analysis
20D 40D 60D
Ann. alpha 0.026 0.027 0.026
beta -0.042 -0.058 -0.061
Mean Period Wise Return Top Quantile (bps) 18.365 19.264 18.333
Mean Period Wise Return Bottom Quantile (bps) -16.370 -14.340 -14.254
Mean Period Wise Spread (bps) 34.889 33.359 32.353
<matplotlib.figure.Figure at 0x7f37c4af6a90>
Information Analysis
20D 40D 60D
IC Mean 0.012 0.018 0.021
IC Std. 0.043 0.040 0.039
Risk-Adjusted IC 0.284 0.443 0.538
t-stat(IC) 8.207 12.836 15.573
p-value(IC) 0.000 0.000 0.000
IC Skew -0.097 0.220 -0.172
IC Kurtosis -0.331 -0.155 -0.123
Turnover Analysis
20D 40D 60D
Quantile 1 Mean Turnover 0.170 0.298 0.378
Quantile 2 Mean Turnover 0.249 0.412 0.515
Quantile 3 Mean Turnover 0.265 0.437 0.547
Quantile 4 Mean Turnover 0.248 0.409 0.506
Quantile 5 Mean Turnover 0.188 0.328 0.411
20D 40D 60D
Mean Factor Rank Autocorrelation 0.867 0.746 0.676

Longest periods (1.5, 3, 4.5 months)

In [13]:
factor_data_c3 = alphalens.utils.get_clean_factor_and_forward_returns(
    cosine_factor,
    prices=prices,
    quantiles=5,
    periods =(30, 60, 90),
)
Dropped 3.4% entries from factor data: 3.4% in forward returns computation and 0.0% in binning phase (set max_loss=0 to see potentially suppressed Exceptions).
max_loss is 35.0%, not exceeded: OK!
In [14]:
alphalens.tears.create_full_tear_sheet(factor_data_c3, by_group=False);
Quantiles Statistics
min max mean std count count %
factor_quantile
1 0.430380 0.854624 0.769891 0.045011 284462 20.022637
2 0.796210 0.887623 0.826763 0.020395 283983 19.988921
3 0.827915 0.907965 0.854116 0.019408 283989 19.989343
4 0.854252 0.925837 0.879279 0.018000 283981 19.988780
5 0.880126 1.000000 0.913437 0.018503 284287 20.010319
Returns Analysis
30D 60D 90D
Ann. alpha 0.026 0.026 0.023
beta -0.050 -0.062 -0.064
Mean Period Wise Return Top Quantile (bps) 27.245 26.756 24.398
Mean Period Wise Return Bottom Quantile (bps) -25.482 -23.400 -19.011
Mean Period Wise Spread (bps) 52.488 49.667 42.807
<matplotlib.figure.Figure at 0x7ff2db51ca10>
Information Analysis
30D 60D 90D
IC Mean 0.015 0.021 0.021
IC Std. 0.043 0.039 0.040
Risk-Adjusted IC 0.344 0.526 0.519
t-stat(IC) 9.811 14.979 14.783
p-value(IC) 0.000 0.000 0.000
IC Skew 0.140 -0.150 -0.265
IC Kurtosis -0.189 -0.156 -0.366
Turnover Analysis
30D 60D 90D
Quantile 1 Mean Turnover 0.244 0.390 0.455
Quantile 2 Mean Turnover 0.347 0.531 0.602
Quantile 3 Mean Turnover 0.368 0.564 0.629
Quantile 4 Mean Turnover 0.345 0.523 0.587
Quantile 5 Mean Turnover 0.270 0.424 0.475
30D 60D 90D
Mean Factor Rank Autocorrelation 0.799 0.665 0.613

A few notes about this tearsheet:

  • In the "Cumulative Return by Quantile" plots, we want to see the top and bottom quantile "fingers" move across the plot without crossing. It looks like they're significantly different over all periods, indicating that our factor is doing a good job of separating high- and low-returning stocks.
  • In the "IC Normal Dist Q-Q" plots, we want to see an S-shaped curve that indicates a Normal distribution with fat tails (since high/low factor values are the stocks that we want to long/short). We do see reasonably S-shaped curves in the plots over all periods.
  • For our top and bottom quantile, the mean turnover looks reasonable -- hovering around roughly 30-40%, which is well within the contest guideline of 5-65%.

How does this compare to the paper's findings? The original paper found a spread of 31 bps in excess return between the 1st and 5th quantile for the cosine similarity score over a three-month holding period, and a spread of 53 bps for the Jaccard similarity score.

Keep in mind that the mean period wise return calculated by Alphalens is the rate of return. As such, it's difficult to compare the Alphalens result exactly with the original result. However, we do see a spread somewhere around 20-50 bps between quantiles (depending on the factor and period), so it seems like our results are generally in-line with the paper's findings.

The next step? Put it in an algorithm and see how it performs in real-world conditions.