Major text changes in 10-K and 10-Q filings over time indicate significant decreases in future returns. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.
Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.
When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.
These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.
The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns. For an overview of this paper from one of the authors, see the Lazy Prices interview from QuantCon 2018.
To understand how the dataset used in this post was created, be sure to see the Data Processing notebook.
from quantopian.research import run_pipeline
from quantopian.pipeline import Pipeline
from quantopian.pipeline.filters import QTradableStocksUS
import alphalens
In this step, we import the data from a local .csv file via the Self-Serve Data feature.
To do this, we begin with the local .csv file generated by the Data Processing notebook. We then upload it under the Self-Serve Data tab on the Account > Data page; this makes it available for import into a research notebook or pipeline.
For more on importing data using Self-Serve, check out the examples in this forum post.
from quantopian.pipeline.data.user_5b102ae91141120040958556 import lazyprices3_90d
The data we uploaded is in a tabular form, with one row per asset per day. However, Alphalens requires that we provide data in a specific format with specific labels. Fortunately, Pipeline will do all the "dirty work" for us.
In this step, we'll use Pipeline to put our data in a form that can be ingested by Alphalens.
def make_pipeline():
jaccard_score = lazyprices3_90d.jaccard_score.latest
cosine_score = lazyprices3_90d.cosine_score.latest
screen = (QTradableStocksUS() & jaccard_score.notnull() & cosine_score.notnull())
return Pipeline(columns={'jaccard_score': jaccard_score, 'cosine_score': cosine_score}, screen=screen)
data = run_pipeline(make_pipeline(), '2013-01-01', '2018-05-01')
Since an alpha factor is supposed to predict the returns of an asset, we'll need to get records of the actual price of the asset in order to examine the performance of our alpha factor. In this step, we get pricing data for the assets in our dataset.
# Get list of relevant assets
assets = data.index.levels[1]
# Get pricing data for those assets
pricing_end_date = '2018-08-01' # Pricing end date should be later so we can get forward returns
prices = get_pricing(assets,
start_date='2013-01-01',
end_date=pricing_end_date,
fields='open_price')
Now that we have both our alpha factor and pricing datasets, we're ready to run our Alphalens study.
Since we have both Jaccard and cosine similarity scores, we'll run two separate Alphalens tearsheets.
Before creating a tearsheet, we'll use get_clean_factor_and_forward_returns
to get our data in the correct format to be ingested by Alphalens.
Note on parameters:
The periods
parameter in get_clean_factor_and_forward_returns
allows us to set the periods over which we assess the performance of our alpha factor (in days). Here, we'll use longer periods, since political processes tend to be longer-term phenomena.
The quantiles
parameter allows us to set the number of bins into which we divide our assets based on their factor values. Since the original paper uses 5 quantiles to estimate portfolio performance, we'll also use 5 quantiles.
jaccard_factor = data[['jaccard_score']]
factor_data_j1 = alphalens.utils.get_clean_factor_and_forward_returns(
jaccard_factor,
prices=prices,
quantiles=5,
periods =(1, 5, 10),
)
alphalens.tears.create_full_tear_sheet(factor_data_j1, by_group=False);
factor_data_j2 = alphalens.utils.get_clean_factor_and_forward_returns(
jaccard_factor,
prices=prices,
quantiles=5,
periods =(20, 40, 60),
)
alphalens.tears.create_full_tear_sheet(factor_data_j2, by_group=False);
factor_data_j3 = alphalens.utils.get_clean_factor_and_forward_returns(
jaccard_factor,
prices=prices,
quantiles=5,
periods =(30, 60, 90),
)
alphalens.tears.create_full_tear_sheet(factor_data_j3, by_group=False);
We'll put our cosine score factor through the same process.
cosine_factor = data[['cosine_score']]
factor_data_c1 = alphalens.utils.get_clean_factor_and_forward_returns(
cosine_factor,
prices=prices,
quantiles=5,
periods =(1, 5, 10),
)
alphalens.tears.create_full_tear_sheet(factor_data_c1, by_group=False);
factor_data_c2 = alphalens.utils.get_clean_factor_and_forward_returns(
cosine_factor,
prices=prices,
quantiles=5,
periods =(20, 40, 60),
)
alphalens.tears.create_full_tear_sheet(factor_data_c2, by_group=False);
factor_data_c3 = alphalens.utils.get_clean_factor_and_forward_returns(
cosine_factor,
prices=prices,
quantiles=5,
periods =(30, 60, 90),
)
alphalens.tears.create_full_tear_sheet(factor_data_c3, by_group=False);
A few notes about this tearsheet:
How does this compare to the paper's findings? The original paper found a spread of 31 bps in excess return between the 1st and 5th quantile for the cosine similarity score over a three-month holding period, and a spread of 53 bps for the Jaccard similarity score.
Keep in mind that the mean period wise return calculated by Alphalens is the rate of return. As such, it's difficult to compare the Alphalens result exactly with the original result. However, we do see a spread somewhere around 20-50 bps between quantiles (depending on the factor and period), so it seems like our results are generally in-line with the paper's findings.
The next step? Put it in an algorithm and see how it performs in real-world conditions.