Major text changes in 10-K and 10-Q filings over time indicate significant decreases in future returns. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.
Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.
When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.
These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.
The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns.
In this investigation, we attempt to replicate their results on Quantopian.
(For an overview of this paper from Lauren Cohen himself, see the Lazy Prices interview from QuantCon 2018.)
Companies make major textual changes to their 10-Ks and 10-Qs when major things happen to their business. Thus, we expect that textual changes to 10-Ks and 10-Qs are a signal of future stock price movement.
Since the vast majority (86%) of textual changes have negative sentiment, we generally expect that major textual changes signal a decrease in stock price (Cohen et al. 2018).
Thus, we expect to find alpha by shorting companies with large textual changes in their 10-Ks and 10-Qs.
This notebook covers steps 1-3. For step 4, see the Alphalens study notebook.
This notebook is intended to be run locally (on your own computer), not within the Quantopian Research environment. We run it locally in order to generate the .csv file for upload into the Self-Serve Data feature.
In order to run this notebook, you will need to have Python 3 and the following packages installed:
All of these packages can be installed using conda or pip. For detailed installation instructions, see the installation documentation for each package (jupyter, pandas, numpy, Requests, scikit-learn, BeautifulSoup, lxml, tqdm).
To run this notebook:
cd
in the command line to navigate to the directory containing the notebook file.jupyter notebook
in the command line to start a jupyter notebook session.# Importing built-in libraries (no need to install these)
import re
import os
from time import gmtime, strftime
from datetime import datetime, timedelta
import unicodedata
# Importing libraries you need to install
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import requests
import bs4 as bs
from lxml import html
from tqdm import tqdm
We need to know what we want to scrape, so we'll begin by compiling a complete* list of U.S. stock tickers.
*for our purposes, "complete" = everything traded on NASDAQ, NYSE, or AMEX.
# Get lists of tickers from NASDAQ, NYSE, AMEX
nasdaq_tickers = pd.read_csv('https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download')
nyse_tickers = pd.read_csv('https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download')
amex_tickers = pd.read_csv('https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download')
# Drop irrelevant cols
nasdaq_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
nyse_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
amex_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
# Create full list of tickers/names across all 3 exchanges
tickers = list(set(list(nasdaq_tickers['Symbol']) + list(nyse_tickers['Symbol']) + list(amex_tickers['Symbol'])))
Unfortunately, the SEC indexes company filings by its own internal identifier, the "Central Index Key" (CIK). We'll need to translate tickers into CIKs in order to search for company filings on EDGAR.
(The code below is an edited version of this gist.)
def MapTickerToCik(tickers):
url = 'http://www.sec.gov/cgi-bin/browse-edgar?CIK={}&Find=Search&owner=exclude&action=getcompany'
cik_re = re.compile(r'.*CIK=(\d{10}).*')
cik_dict = {}
for ticker in tqdm(tickers): # Use tqdm lib for progress bar
results = cik_re.findall(requests.get(url.format(ticker)).text)
if len(results):
cik_dict[str(ticker).lower()] = str(results[0])
return cik_dict
cik_dict = MapTickerToCik(tickers)
# Clean up the ticker-CIK mapping as a DataFrame
ticker_cik_df = pd.DataFrame.from_dict(data=cik_dict, orient='index')
ticker_cik_df.reset_index(inplace=True)
ticker_cik_df.columns = ['ticker', 'cik']
ticker_cik_df['cik'] = [str(cik) for cik in ticker_cik_df['cik']]
Our ultimate goal is to link each ticker to a unique CIK.
However, some CIKs might be linked to multiple tickers. For example, different share classes within the same company would all be linked to the same CIK. Let's get rid of these duplicate mappings.
# Check for duplicated tickers/CIKs
print("Number of ticker-cik pairings:", len(ticker_cik_df))
print("Number of unique tickers:", len(set(ticker_cik_df['ticker'])))
print("Number of unique CIKs:", len(set(ticker_cik_df['cik'])))
It looks like about 200 (4.5%) CIKs are linked to multiple tickers. To eliminate the duplicate mappings, we'll simply keep the ticker that comes first in the alphabet. In most cases, this means we'll keep the class A shares of the stock.
It's certainly possible to eliminate duplicates using other methods; for the sake of simplicity, we'll stick with alphabetizing for now. As long as we apply it uniformly across all stocks, it shouldn't introduce any bias.
# Keep first ticker alphabetically for duplicated CIKs
ticker_cik_df = ticker_cik_df.sort_values(by='ticker')
ticker_cik_df.drop_duplicates(subset='cik', keep='first', inplace=True)
# Check that we've eliminated duplicate tickers/CIKs
print("Number of ticker-cik pairings:", len(ticker_cik_df))
print("Number of unique tickers:", len(set(ticker_cik_df['ticker'])))
print("Number of unique CIKs:", len(set(ticker_cik_df['cik'])))
At this point, we have a list of the CIKs for which we want to obtain 10-Ks and 10-Qs. We can now begin scraping from EDGAR.
As with many web scraping projects, we'll need to keep some technical considerations in mind:
def WriteLogFile(log_file_name, text):
'''
Helper function.
Writes a log file with all notes and
error messages from a scraping "session".
Parameters
----------
log_file_name : str
Name of the log file (should be a .txt file).
text : str
Text to write to the log file.
Returns
-------
None.
'''
with open(log_file_name, "a") as log_file:
log_file.write(text)
return
The function below scrapes all 10-Ks and 10-K405s one particular CIK. Our web scraper primarily depends on the requests
and BeautifulSoup
libraries.
Note that the scraper creates a different directory for each CIK, and puts all the filings for that CIK within that directory. After scraping, your file structure should look like this:
- 10Ks
- CIK1
- 10K #1
- 10K #2
...
- CIK2
- 10K #1
- 10K #2
...
- CIK3
- 10K #1
- 10K #2
...
...
- 10Qs
- CIK1
- 10Q #1
- 10Q #2
...
- CIK2
- 10Q #1
- 10Q #2
...
- CIK3
- 10Q #1
- 10Q #2
...
...
The scraper will create the directory for each CIK. However, we need to create different directories to hold our 10-K and 10-Q files. The exact pathname depends on your local setup, so you'll need to fill in the correct pathname_10k
and pathname_10q
for your machine.
pathname_10k = '< YOUR 10-K PATHNAME HERE>'
pathname_10q = '< YOUR 10-Q PATHNAME HERE>'
def Scrape10K(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name):
'''
Scrapes all 10-Ks and 10-K405s for a particular
CIK from EDGAR.
Parameters
----------
browse_url_base : str
Base URL for browsing EDGAR.
filing_url_base : str
Base URL for filings listings on EDGAR.
doc_url_base : str
Base URL for one filing's document tables
page on EDGAR.
cik : str
Central Index Key.
log_file_name : str
Name of the log file (should be a .txt file).
Returns
-------
None.
'''
# Check if we've already scraped this CIK
try:
os.mkdir(cik)
except OSError:
print("Already scraped CIK", cik)
return
# If we haven't, go into the directory for that CIK
os.chdir(cik)
print('Scraping CIK', cik)
# Request list of 10-K filings
res = requests.get(browse_url_base % cik)
# If the request failed, log the failure and exit
if res.status_code != 200:
os.chdir('..')
os.rmdir(cik) # remove empty dir
text = "Request failed with error code " + str(res.status_code) + \
"\nFailed URL: " + (browse_url_base % cik) + '\n'
WriteLogFile(log_file_name, text)
return
# If the request doesn't fail, continue...
# Parse the response HTML using BeautifulSoup
soup = bs.BeautifulSoup(res.text, "lxml")
# Extract all tables from the response
html_tables = soup.find_all('table')
# Check that the table we're looking for exists
# If it doesn't, exit
if len(html_tables)<3:
os.chdir('..')
return
# Parse the Filings table
filings_table = pd.read_html(str(html_tables[2]), header=0)[0]
filings_table['Filings'] = [str(x) for x in filings_table['Filings']]
# Get only 10-K and 10-K405 document filings
filings_table = filings_table[(filings_table['Filings'] == '10-K') | (filings_table['Filings'] == '10-K405')]
# If filings table doesn't have any
# 10-Ks or 10-K405s, exit
if len(filings_table)==0:
os.chdir('..')
return
# Get accession number for each 10-K and 10-K405 filing
filings_table['Acc_No'] = [x.replace('\xa0',' ')
.split('Acc-no: ')[1]
.split(' ')[0] for x in filings_table['Description']]
# Iterate through each filing and
# scrape the corresponding document...
for index, row in filings_table.iterrows():
# Get the accession number for the filing
acc_no = str(row['Acc_No'])
# Navigate to the page for the filing
docs_page = requests.get(filing_url_base % (cik, acc_no))
# If request fails, log the failure
# and skip to the next filing
if docs_page.status_code != 200:
os.chdir('..')
text = "Request failed with error code " + str(docs_page.status_code) + \
"\nFailed URL: " + (filing_url_base % (cik, acc_no)) + '\n'
WriteLogFile(log_file_name, text)
os.chdir(cik)
continue
# If request succeeds, keep going...
# Parse the table of documents for the filing
docs_page_soup = bs.BeautifulSoup(docs_page.text, 'lxml')
docs_html_tables = docs_page_soup.find_all('table')
if len(docs_html_tables)==0:
continue
docs_table = pd.read_html(str(docs_html_tables[0]), header=0)[0]
docs_table['Type'] = [str(x) for x in docs_table['Type']]
# Get the 10-K and 10-K405 entries for the filing
docs_table = docs_table[(docs_table['Type'] == '10-K') | (docs_table['Type'] == '10-K405')]
# If there aren't any 10-K or 10-K405 entries,
# skip to the next filing
if len(docs_table)==0:
continue
# If there are 10-K or 10-K405 entries,
# grab the first document
elif len(docs_table)>0:
docs_table = docs_table.iloc[0]
docname = docs_table['Document']
# If that first entry is unavailable,
# log the failure and exit
if str(docname) == 'nan':
os.chdir('..')
text = 'File with CIK: %s and Acc_No: %s is unavailable' % (cik, acc_no) + '\n'
WriteLogFile(log_file_name, text)
os.chdir(cik)
continue
# If it is available, continue...
# Request the file
file = requests.get(doc_url_base % (cik, acc_no.replace('-', ''), docname))
# If the request fails, log the failure and exit
if file.status_code != 200:
os.chdir('..')
text = "Request failed with error code " + str(file.status_code) + \
"\nFailed URL: " + (doc_url_base % (cik, acc_no.replace('-', ''), docname)) + '\n'
WriteLogFile(log_file_name, text)
os.chdir(cik)
continue
# If it succeeds, keep going...
# Save the file in appropriate format
if '.txt' in docname:
# Save text as TXT
date = str(row['Filing Date'])
filename = cik + '_' + date + '.txt'
html_file = open(filename, 'a')
html_file.write(file.text)
html_file.close()
else:
# Save text as HTML
date = str(row['Filing Date'])
filename = cik + '_' + date + '.html'
html_file = open(filename, 'a')
html_file.write(file.text)
html_file.close()
# Move back to the main 10-K directory
os.chdir('..')
return
def Scrape10Q(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name):
'''
Scrapes all 10-Qs for a particular CIK from EDGAR.
Parameters
----------
browse_url_base : str
Base URL for browsing EDGAR.
filing_url_base : str
Base URL for filings listings on EDGAR.
doc_url_base : str
Base URL for one filing's document tables
page on EDGAR.
cik : str
Central Index Key.
log_file_name : str
Name of the log file (should be a .txt file).
Returns
-------
None.
'''
# Check if we've already scraped this CIK
try:
os.mkdir(cik)
except OSError:
print("Already scraped CIK", cik)
return
# If we haven't, go into the directory for that CIK
os.chdir(cik)
print('Scraping CIK', cik)
# Request list of 10-Q filings
res = requests.get(browse_url_base % cik)
# If the request failed, log the failure and exit
if res.status_code != 200:
os.chdir('..')
os.rmdir(cik) # remove empty dir
text = "Request failed with error code " + str(res.status_code) + \
"\nFailed URL: " + (browse_url_base % cik) + '\n'
WriteLogFile(log_file_name, text)
return
# If the request doesn't fail, continue...
# Parse the response HTML using BeautifulSoup
soup = bs.BeautifulSoup(res.text, "lxml")
# Extract all tables from the response
html_tables = soup.find_all('table')
# Check that the table we're looking for exists
# If it doesn't, exit
if len(html_tables)<3:
print("table too short")
os.chdir('..')
return
# Parse the Filings table
filings_table = pd.read_html(str(html_tables[2]), header=0)[0]
filings_table['Filings'] = [str(x) for x in filings_table['Filings']]
# Get only 10-Q document filings
filings_table = filings_table[filings_table['Filings'] == '10-Q']
# If filings table doesn't have any
# 10-Ks or 10-K405s, exit
if len(filings_table)==0:
os.chdir('..')
return
# Get accession number for each 10-K and 10-K405 filing
filings_table['Acc_No'] = [x.replace('\xa0',' ')
.split('Acc-no: ')[1]
.split(' ')[0] for x in filings_table['Description']]
# Iterate through each filing and
# scrape the corresponding document...
for index, row in filings_table.iterrows():
# Get the accession number for the filing
acc_no = str(row['Acc_No'])
# Navigate to the page for the filing
docs_page = requests.get(filing_url_base % (cik, acc_no))
# If request fails, log the failure
# and skip to the next filing
if docs_page.status_code != 200:
os.chdir('..')
text = "Request failed with error code " + str(docs_page.status_code) + \
"\nFailed URL: " + (filing_url_base % (cik, acc_no)) + '\n'
WriteLogFile(log_file_name, text)
os.chdir(cik)
continue
# If request succeeds, keep going...
# Parse the table of documents for the filing
docs_page_soup = bs.BeautifulSoup(docs_page.text, 'lxml')
docs_html_tables = docs_page_soup.find_all('table')
if len(docs_html_tables)==0:
continue
docs_table = pd.read_html(str(docs_html_tables[0]), header=0)[0]
docs_table['Type'] = [str(x) for x in docs_table['Type']]
# Get the 10-K and 10-K405 entries for the filing
docs_table = docs_table[docs_table['Type'] == '10-Q']
# If there aren't any 10-K or 10-K405 entries,
# skip to the next filing
if len(docs_table)==0:
continue
# If there are 10-K or 10-K405 entries,
# grab the first document
elif len(docs_table)>0:
docs_table = docs_table.iloc[0]
docname = docs_table['Document']
# If that first entry is unavailable,
# log the failure and exit
if str(docname) == 'nan':
os.chdir('..')
text = 'File with CIK: %s and Acc_No: %s is unavailable' % (cik, acc_no) + '\n'
WriteLogFile(log_file_name, text)
os.chdir(cik)
continue
# If it is available, continue...
# Request the file
file = requests.get(doc_url_base % (cik, acc_no.replace('-', ''), docname))
# If the request fails, log the failure and exit
if file.status_code != 200:
os.chdir('..')
text = "Request failed with error code " + str(file.status_code) + \
"\nFailed URL: " + (doc_url_base % (cik, acc_no.replace('-', ''), docname)) + '\n'
WriteLogFile(log_file_name, text)
os.chdir(cik)
continue
# If it succeeds, keep going...
# Save the file in appropriate format
if '.txt' in docname:
# Save text as TXT
date = str(row['Filing Date'])
filename = cik + '_' + date + '.txt'
html_file = open(filename, 'a')
html_file.write(file.text)
html_file.close()
else:
# Save text as HTML
date = str(row['Filing Date'])
filename = cik + '_' + date + '.html'
html_file = open(filename, 'a')
html_file.write(file.text)
html_file.close()
# Move back to the main 10-Q directory
os.chdir('..')
return
Now that we've defined our scraper functions, let's scrape.
(A note from the future: we're scraping a lot of data, which takes time and space. For reference, these functions ultimately scraped 170 GB of 10-Qs and 125 GB of 10-Ks; the scraping took roughly 20 hours total.)
# Run the function to scrape 10-Ks
# Define parameters
browse_url_base_10k = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-K'
filing_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'
# Set correct directory
os.chdir(pathname_10k)
# Initialize log file
# (log file name = the time we initiate scraping session)
time = strftime("%Y-%m-%d %Hh%Mm%Ss", gmtime())
log_file_name = 'log '+time+'.txt'
with open(log_file_name, 'a') as log_file:
log_file.close()
# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(ticker_cik_df['cik']):
Scrape10K(browse_url_base=browse_url_base_10k,
filing_url_base=filing_url_base_10k,
doc_url_base=doc_url_base_10k,
cik=cik,
log_file_name=log_file_name)
# Run the function to scrape 10-Qs
# Define parameters
browse_url_base_10q = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-Q&count=1000'
filing_url_base_10q = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10q = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'
# Set correct directory (fill this out yourself!)
os.chdir(pathname_10q)
# Initialize log file
# (log file name = the time we initiate scraping session)
time = strftime("%Y-%m-%d %Hh%Mm%Ss", gmtime())
log_file_name = 'log '+time+'.txt'
log_file = open(log_file_name, 'a')
log_file.close()
# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(ticker_cik_df['cik']):
Scrape10Q(browse_url_base=browse_url_base_10q,
filing_url_base=filing_url_base_10q,
doc_url_base=doc_url_base_10q,
cik=cik,
log_file_name=log_file_name)
We now have 10-Ks and 10-Qs in HTML or plaintext (.txt) format for each CIK. Before computing our similarity scores, however, we need to clean the files up a bit.
As outlined in the paper, we will:
... remove all tables (if their numeric character content is greater than 15%), HTML tags, XBRL tables, exhibits, ASCII-encoded PDFs, graphics, XLS, and other binary files.
def RemoveNumericalTables(soup):
'''
Removes tables with >15% numerical characters.
Parameters
----------
soup : BeautifulSoup object
Parsed result from BeautifulSoup.
Returns
-------
soup : BeautifulSoup object
Parsed result from BeautifulSoup
with numerical tables removed.
'''
# Determines percentage of numerical characters
# in a table
def GetDigitPercentage(tablestring):
if len(tablestring)>0.0:
numbers = sum([char.isdigit() for char in tablestring])
length = len(tablestring)
return numbers/length
else:
return 1
# Evaluates numerical character % for each table
# and removes the table if the percentage is > 15%
[x.extract() for x in soup.find_all('table') if GetDigitPercentage(x.get_text())>0.15]
return soup
def RemoveTags(soup):
'''
Drops HTML tags, newlines and unicode text from
filing text.
Parameters
----------
soup : BeautifulSoup object
Parsed result from BeautifulSoup.
Returns
-------
text : str
Filing text.
'''
# Remove HTML tags with get_text
text = soup.get_text()
# Remove newline characters
text = text.replace('\n', ' ')
# Replace unicode characters with their
# "normal" representations
text = unicodedata.normalize('NFKD', text)
return text
def ConvertHTML(cik):
'''
Removes numerical tables, HTML tags,
newlines, unicode text, and XBRL tables.
Parameters
----------
cik : str
Central Index Key used to scrape files.
Returns
-------
None.
'''
# Look for files scraped for that CIK
try:
os.chdir(cik)
# ...if we didn't scrape any files for that CIK, exit
except FileNotFoundError:
print("Could not find directory for CIK", cik)
return
print("Parsing CIK %s..." % cik)
parsed = False # flag to tell if we've parsed anything
# Try to make a new directory within the CIK directory
# to store the text representations of the filings
try:
os.mkdir('rawtext')
# If it already exists, continue
# We can't exit at this point because we might be
# partially through parsing text files, so we need to continue
except OSError:
pass
# Get list of scraped files
# excluding hidden files and directories
file_list = [fname for fname in os.listdir() if not (fname.startswith('.') | os.path.isdir(fname))]
# Iterate over scraped files and clean
for filename in file_list:
# Check if file has already been cleaned
new_filename = filename.replace('.html', '.txt')
text_file_list = os.listdir('rawtext')
if new_filename in text_file_list:
continue
# If it hasn't been cleaned already, keep going...
# Clean file
with open(filename, 'r') as file:
parsed = True
soup = bs.BeautifulSoup(file.read(), "lxml")
soup = RemoveNumericalTables(soup)
text = RemoveTags(soup)
with open('rawtext/'+new_filename, 'w') as newfile:
newfile.write(text)
# If all files in the CIK directory have been parsed
# then log that
if parsed==False:
print("Already parsed CIK", cik)
os.chdir('..')
return
We can now apply this function to each of our 10-K and 10-Q files.
# For 10-Ks...
os.chdir(pathname_10k)
# Iterate over CIKs and clean HTML filings
for cik in tqdm(ticker_cik_df['cik']):
ConvertHTML(cik)
# For 10-Qs...
os.chdir(pathname_10q)
# Iterate over CIKs and clean HTML filings
for cik in tqdm(ticker_cik_df['cik']):
ConvertHTML(cik)
After running the two cells above, we have cleaned plaintext 10-K and 10-Q filings for each CIK. At this point, our file structure looks like this:
- 10Ks
- CIK1
- 10K #1
- 10K #2
...
- rawtext
- CIK2
- 10K #1
- 10K #2
...
- rawtext
- CIK3
- 10K #1
- 10K #2
...
- rawtext
...
- 10Qs
- CIK1
- 10Q #1
- 10Q #2
...
- rawtext
- CIK2
- 10Q #1
- 10Q #2
...
- rawtext
- CIK3
- 10Q #1
- 10Q #2
...
- rawtext
...
We can now begin computing our alpha factor (similarity scores).
We'll use cosine similarity and Jaccard similarity to compare documents.
(The original paper also uses two other, simpler similarity measures, but cosine and Jaccard appeared to result in the best alpha factor performance -- and are much less computationally intensive to compute.)
def ComputeCosineSimilarity(words_A, words_B):
'''
Compute cosine similarity between document A and
document B.
Parameters
----------
words_A : set
Words in document A.
words_B : set
Words in document B
Returns
-------
cosine_score : float
Cosine similarity between document
A and document B.
'''
# Compile complete set of words in A or B
words = list(words_A.union(words_B))
# Determine which words are in A
vector_A = [1 if x in words_A else 0 for x in words]
# Determine which words are in B
vector_B = [1 if x in words_B else 0 for x in words]
# Compute cosine score using scikit-learn
array_A = np.array(vector_A).reshape(1, -1)
array_B = np.array(vector_B).reshape(1, -1)
cosine_score = cosine_similarity(array_A, array_B)[0,0]
return cosine_score
def ComputeJaccardSimilarity(words_A, words_B):
'''
Compute Jaccard similarity between document A and
document B.
Parameters
----------
words_A : set
Words in document A.
words_B : set
Words in document B
Returns
-------
jaccard_score : float
Jaccard similarity between document
A and document B.
'''
# Count number of words in both A and B
words_intersect = len(words_A.intersection(words_B))
# Count number of words in A or B
words_union = len(words_A.union(words_B))
# Compute Jaccard similarity score
jaccard_score = words_intersect / words_union
return jaccard_score
Before continuing, let's double-check that these functions are working properly.
The paper gives the following sample sentences to compare:
$D_A$: We expect demand to increase. $D_B$: We expect worldwide demand to increase. $D_C$: We expect weakness in sales.
As noted in the paper, the cosine similarity between $D_A$ and $D_B$ should be $0.91$, and the cosine similarity between $D_A$ and $D_C$ should be $0.40$.
Meanwhile, the Jaccard similarity between $D_A$ and $D_B$ should be $0.83$, and the Jaccard similarity between $D_A$ and $D_C$ should be $0.25$.
Let's double-check that our functions return the correct results.
d_a = set(['we', 'expect', 'demand', 'to', 'increase'])
d_b = set(['we', 'expect', 'worldwide', 'demand', 'to', 'increase'])
d_c = set(['we', 'expect', 'weakness', 'in', 'sales'])
print("Cosine similarity between A and B:", ComputeCosineSimilarity(d_a, d_b))
print("Cosine similarity between A and C:", ComputeCosineSimilarity(d_a, d_c))
print("Jaccard similarity between A and B:", ComputeJaccardSimilarity(d_a, d_b))
print("Jaccard similarity between A and C:", ComputeJaccardSimilarity(d_a, d_c))
Everything looks good! Now, let's begin applying these similarity computations to the scraped 10-Ks and 10-Qs.
We'll start with 10-Qs. This is slightly difficult, because we want to compare each 10-Q to the 10-Q from the same quarter of the previous year.
Keep in mind that 10-Qs are filed three times per year, and 10-Ks are filed once per year. Thus, we should simply be able to order the 10-Qs by filing date, then compare each 10-Q to the third-to-last file. For example, if our sorted list of 10-Q files was: [10Q-1, 10Q-2, 10Q-3, 10Q-4, 10Q-5, 10Q-6...]
then we would iterate over the list and compare 10Q-4
to 10Q-1
, 10Q-5
to 10Q-2
, and so on.
Unfortunately, filings aren't so clean in the real world. Sometimes, companies don't file 10-Qs for a quarter (or more) for a variety of reasons. We don't want to compare 10-Qs from different quarters; quarter-on-quarter differences will create misleading noise in our ultimate factor values. [why don't companies file 10Ks? why are 10Qs different per quarter?]
As such, we'll take each 10-Q and look for 10-Qs that are dated between 345 and 385 days earlier. If one exists, we'll compute the similarity; if no such file exists, we'll report the scores as NaN
.
def ComputeSimilarityScores10Q(cik):
'''
Computes cosine and Jaccard similarity scores
over 10-Qs for a particular CIK.
Compares each 10-Q to the 10-Q from the same
quarter of the previous year.
Parameters
----------
cik: str
Central Index Key used to scrape and name
files.
Returns
-------
None.
'''
# Define how stringent we want to be about
# "previous year"
year_short = timedelta(345)
year_long = timedelta(385)
# Open directory that holds plain 10-Q textfiles
# for the CIK
os.chdir(cik+'/rawtext')
print("Parsing CIK %s..." % cik)
# Get list of files to compare
file_list = [fname for fname in os.listdir() if not
(fname.startswith('.') | os.path.isdir(fname))]
file_list.sort()
# Check if scores have already been calculated
try:
os.mkdir('../metrics')
# ... if they have already been calculated, exit
except OSError:
print("Already parsed CIK %s..." % cik)
os.chdir('../..')
return
# Check if enough files exist to compare
# ... if there aren't enough files, exit
if len(file_list) < 4:
print("No files to compare for CIK", cik)
os.chdir('../..')
return
# Initialize dataframe to hold similarity scores
dates = [x[-14:-4] for x in file_list]
cosine_score = [0]*len(dates)
jaccard_score = [0]*len(dates)
data = pd.DataFrame(columns={'cosine_score': cosine_score,
'jaccard_score': jaccard_score},
index=dates)
# Iterate over each quarter...
for j in range(3):
# Get text and date of earliest filing from that quarter
file_name_A = file_list[j]
with open(file_name_A, 'r') as file:
file_text_A = file.read()
date_A = datetime.strptime(file_name_A[-14:-4], '%Y-%m-%d')
# Iterate over the rest of the filings from that quarter...
for i in range(j+3, len(file_list), 3):
# Get name and date of the later file
file_name_B = file_list[i]
date_B = datetime.strptime(file_name_B[-14:-4], '%Y-%m-%d')
# If B was not filed within ~1 year after A...
if (date_B > (date_A + year_long)) or (date_B < (date_A + year_short)):
print(date_B.strftime('%Y-%m-%d'), "is not within a year of", date_A.strftime('%Y-%m-%d'))
# Record values as NaN
data.at[date_B.strftime('%Y-%m-%d'), 'cosine_score'] = 'NaN'
data.at[date_B.strftime('%Y-%m-%d'), 'jaccard_score'] = 'NaN'
# Pretend as if we found new date_A in the next year
date_A = date_A.replace(year=date_B.year)
# Move to next filing
continue
# If B was filed within ~1 year of A...
# Get file text
with open(file_name_B, 'r') as file:
file_text_B = file.read()
# Get sets of words in A, B
words_A = set(re.findall(r"[\w']+", file_text_A))
words_B = set(re.findall(r"[\w']+", file_text_B))
# Calculate similarity score
cosine_score = ComputeCosineSimilarity(words_A, words_B)
jaccard_score = ComputeJaccardSimilarity(words_A, words_B)
# Store value (indexing by the date of document B)
data.at[date_B.strftime('%Y-%m-%d'), 'cosine_score'] = cosine_score
data.at[date_B.strftime('%Y-%m-%d'), 'jaccard_score'] = jaccard_score
# Reset value for next loop
# Don't re-read files, for efficiency
file_text_A = file_text_B
date_A = date_B
# Save scores
os.chdir('../metrics')
data.to_csv(cik+'_sim_scores.csv', index=True)
os.chdir('../..')
Fortunately, 10-Ks are easier. Though there can still be time-jumps in 10-K filings, we don't mind as much if we're comparing a 10-K from 2006 to a 10-K from 2002. This is because we don't have to worry about non-substantive quarter-on-quarter differences as we would with 10-Qs. In fact, it might actually be better if our data reflects textual changes in 10-Ks across "absent" years.
def ComputeSimilarityScores10K(cik):
'''
Computes cosine and Jaccard similarity scores
over 10-Ks for a particular CIK.
Parameters
----------
cik: str
Central Index Key used to scrape and name
files.
Returns
-------
None.
'''
# Open the directory that holds plaintext
# filings for the CIK
os.chdir(cik+'/rawtext')
print("Parsing CIK %s..." % cik)
# Get list of files to over which to compute scores
# excluding hidden files and directories
file_list = [fname for fname in os.listdir() if not
(fname.startswith('.') | os.path.isdir(fname))]
file_list.sort()
# Check if scores have already been calculated...
try:
os.mkdir('../metrics')
# ... if they have been, exit
except OSError:
print("Already parsed CIK %s..." % cik)
os.chdir('../..')
return
# Check if enough files exist to compute sim scores...
# If not, exit
if len(file_list) < 2:
print("No files to compare for CIK", cik)
os.chdir('../..')
return
# Initialize dataframe to store sim scores
dates = [x[-14:-4] for x in file_list]
cosine_score = [0]*len(dates)
jaccard_score = [0]*len(dates)
data = pd.DataFrame(columns={'cosine_score': cosine_score,
'jaccard_score': jaccard_score},
index=dates)
# Open first file
file_name_A = file_list[0]
with open(file_name_A, 'r') as file:
file_text_A = file.read()
# Iterate over each 10-K file...
for i in range(1, len(file_list)):
file_name_B = file_list[i]
# Get file text B
with open(file_name_B, 'r') as file:
file_text_B = file.read()
# Get set of words in A, B
words_A = set(re.findall(r"[\w']+", file_text_A))
words_B = set(re.findall(r"[\w']+", file_text_B))
# Calculate similarity scores
cosine_score = ComputeCosineSimilarity(words_A, words_B)
jaccard_score = ComputeJaccardSimilarity(words_A, words_B)
# Store score values
date_B = file_name_B[-14:-4]
data.at[date_B, 'cosine_score'] = cosine_score
data.at[date_B, 'jaccard_score'] = jaccard_score
# Reset value for next loop
# (We don't open the file again, for efficiency)
file_text_A = file_text_B
# Save scores
os.chdir('../metrics')
data.to_csv(cik+'_sim_scores.csv', index=False)
os.chdir('../..')
Note that we store factor values according to the date of the later document (document B). This is because we want our data to be point-in-time; we want to store the factor values according to the date that we would have known about them in the past.
In this case, our values (similarity scores) depend on two things: the content of document A, and the content of document B. We would have become aware of the text of document A at date_A
, and we would have become aware of the text of document B at date_B
.
Remember that A is stipulated to be the earlier document. Since A precedes B, we wouldn't know about the factor values at date_A
; the content of B would not yet be available. However, we would know about the factor values at date_B
-- at that point, both the content of A and the content of B would be available. As such, we'll store our values according to date_B
.
(The above applies to both 10-Qs and 10-Ks.)
Let's go ahead and apply these functions to our stored 10-Qs and 10-Ks.
# Computing scores for 10-Qs...
os.chdir(pathname_10q)
for cik in tqdm(ticker_cik_df['cik']):
ComputeSimilarityScores10Q(cik)
# Computing scores for 10-Ks...
os.chdir(pathname_10k)
for cik in tqdm(ticker_cik_df['cik']):
ComputeSimilarityScores10K(cik)
After computing the similarity scores, our file structure looks like this:
- 10Ks
- CIK1
- 10K #1
- 10K #2
...
- rawtext
- metrics
- CIK2
- 10K #1
- 10K #2
...
- rawtext
- metrics
- CIK3
- 10K #1
- 10K #2
...
- rawtext
- metrics
...
- 10Qs
- CIK1
- 10Q #1
- 10Q #2
...
- rawtext
- metrics
- CIK2
- 10Q #1
- 10Q #2
...
- rawtext
- metrics
- CIK3
- 10Q #1
- 10Q #2
...
- rawtext
- metrics
...
The similarity scores for each CIK are stored in the metrics
directory as a .csv file.
Now that we've scraped the data and computed the similarity scores, we're almost done. The final step is to format our data properly for upload to Self-Serve Data.
We'll begin by consolidating the .csv files in the 10-K and 10-Q directories into a single DataFrame for each CIK.
def GetData(cik, pathname_10k, pathname_10q, pathname_data):
'''
Consolidate 10-K and 10-Q data into a single dataframe
for a CIK.
Parameters
----------
cik : str
Central Index Key used to scrape and
store data.
pathname_10k : str
Path to directory holding 10-K files.
pathname_10q : str
Path to directory holding 10-Q files.
pathname_data : str
Path to directory holding newly
generated data files.
Returns
-------
None.
'''
# Flags to determine what data we have
data_10k = True
data_10q = True
print("Gathering data for CIK %s..." % cik)
file_name = ('%s_sim_scores_full.csv' % cik)
# Check if data has already been gathered...
os.chdir(pathname_data)
file_list = [fname for fname in os.listdir() if not fname.startswith('.')]
# ... if it has been, exit
if file_name in file_list:
print("Already gathered data for CIK", cik)
return
# Try to get 10-K data...
os.chdir(pathname_10k+'/%s/metrics' % cik)
try:
sim_scores_10k = pd.read_csv(cik+'_sim_scores.csv')
# ... if it doesn't exist, set 10-K flag to False
except FileNotFoundError:
print("No data to gather.")
data_10k = False
# Try to get 10-Q data...
os.chdir(pathname_10q+'/%s/metrics' % cik)
try:
sim_scores_10q = pd.read_csv(cik+'_sim_scores.csv')
# ... if it doesn't exist, set 10-Q flag to False
except FileNotFoundError:
print("No data to gather.")
data_10q = False
# Merge depending on available data...
# ... if there's no 10-K or 10-Q data, exit
if not (data_10k and data_10q):
return
# ... if there's no 10-Q data (but there is 10-K data),
# only use the 10-K data
if not data_10q:
sim_scores = sim_scores_10k
# ... if the opposite is true, only use 10-Q data
elif not data_10k:
sim_scores = sim_scores_10q
# ... if there's both 10-K and 10-Q data, merge
elif (data_10q and data_10k):
sim_scores = pd.concat([sim_scores_10k, sim_scores_10q],
axis='index')
# Rename date column
sim_scores.rename(columns={'Unnamed: 0': 'date'}, inplace=True)
# Set CIK column
sim_scores['cik'] = cik
# Save file in the data dir
os.chdir(pathname_data)
sim_scores.to_csv('%s_sim_scores_full.csv' % cik, index=False)
return
pathname_data = '< YOUR DATA PATHNAME HERE >' # Fill this out
for cik in tqdm(ticker_cik_df['cik']):
GetData(cik, pathname_10k, pathname_10q, pathname_data)
Now, we have a "data" directory that looks like this:
- data
- CIK1_sim_scores_full.csv
- CIK2_sim_scores_full.csv
...
Of course, we need to consolidate each CIK's data into a single dataset.
def MakeDataset(file_list, pathname_full_data):
'''
Consolidates CIK datasets into a
single dataset.
Parameters
----------
file_list : list
List of .csv files to merge.
pathname_full_data : str
Path to directory to store
full dataset.
Returns
-------
None.
'''
# Initialize dataframe to store results
data = pd.DataFrame(columns=['date', 'cosine_score', 'jaccard_score', 'cik'])
# Iterate over files and merge all together
for file_name in tqdm(file_list):
new_data = pd.read_csv(file_name)
data = data.append(new_data, sort=True)
# Store result
os.chdir(pathname_full_data)
data.to_csv('all_sim_scores.csv', index=False)
return
pathname_full_data = '< YOUR FULL DATA PATHNAME HERE >'
os.chdir(pathname_data)
file_list = [fname for fname in os.listdir() if not fname.startswith('.')]
MakeDataset(file_list, pathname_full_data)
The final step is to transform the data into a format appropriate for Self-Serve Data. This means that we want a dataset with one set of factor values per ticker per day. In other words, each day-ticker pair should have a cosine_score
and a jaccard_score
value.
In this step, we'll need to:
ticker_cik_df
.NaN
s for missing values).sim_scores_full = pd.read_csv('all_sim_scores.csv')
# Cast CIKs as strings
sim_scores_full['cik'] = [str(x) for x in sim_scores_full['cik']]
# Merge to map tickers to CIKs
sim_scores_ticker = sim_scores_full.merge(ticker_cik_df, how='left', on='cik')
# Drop CIK column
sim_scores_ticker.drop(labels=['cik'], axis='columns', inplace=True)
# Drop NaN values
sim_scores_ticker.dropna(axis='index', how='any', subset=['jaccard_score', 'cosine_score'], inplace=True)
The sim_scores_ticker
data has one row for each filing, listing the set of factor values, ticker, and date. However, some day-ticker pairs have no associated factor values. We need to manipulate this data so that we have one row per ticker per day.
To do this, we'll begin with an empty dataframe that contains one row per ticker per day. We'll then join this formatted empty dataframe (empty_data
) with our actual data (sim_scores_ticker
) in such a way that preserves all the rows of empty_data
. We'll end up with a dataframe that contains all the data from sim_scores_ticker
, with NaN
s inserted for day-ticker pairs where we're missing values.
def InitializeEmptyDataframe(start_date, end_date, tickers):
'''
Initializes an empty DataFrame with all correct indices
(1 entry/ticker/day)
Parameters
----------
start_date : datetime.datetime
Start date of dataframe.
end_date : datetime.datetime
End date of dataframe.
tickers : list
List of tickers.
'''
window_length_days = int((end_date - start_date).days)
date_list = [start_date+timedelta(days=x) for x in range(0, window_length_days)]
long_date_list = date_list * len(tickers)
long_date_list = [x.strftime('%Y-%m-%d') for x in long_date_list]
list.sort(long_date_list)
empty = pd.DataFrame(data={'date': long_date_list,
'ticker': tickers*len(date_list),
'jaccard_score': [np.nan]*len(tickers)*len(date_list),
'cosine_score': [np.nan]*len(tickers)*len(date_list)})
empty = empty.groupby(['date', 'ticker']).sum()
empty['jaccard_score'] = np.nan
empty['cosine_score'] = np.nan
return empty
# Initialize empty dataframe
start_date = datetime(2015, 1, 1)
end_date = datetime(2018, 1, 1)
tickers = list(set(sim_scores_ticker['ticker']))
empty_data = InitializeEmptyDataframe(start_date, end_date, tickers)
(Note that we set start_date = datetime(2013, 1, 1)
. This is because Self-Serve Data has a maximum file size of 300 MB; too large of a date range will exceed that maximum. The dataset spanning 2013-2018 is 250 MB.)
empty_data.head()
# Format sim_scores data for merging
sim_scores_formatted = sim_scores_ticker.dropna(axis='index', how='any', subset=['jaccard_score', 'cosine_score'])
sim_scores_formatted = sim_scores_formatted.groupby(['date', 'ticker']).agg('mean')
Note that we use the .agg('mean')
aggregator. This means that we'll take any rows that match each ticker-day pair and average the factor values.
In most cases, there should only be one row per ticker-day pair; however, there are some cases where 10-Ks and 10-Qs are filed on the same day, thus creating the need for .agg('mean')
.
sim_scores_formatted.head()
formatted_data = empty_data.join(sim_scores_formatted,
how='left',
on=['date', 'ticker'],
lsuffix='_empty')
formatted_data.drop(labels=['cosine_score_empty', 'jaccard_score_empty'], axis='columns', inplace=True)
formatted_data.head()
Our final step is to forward-fill the values by one quarter (approximately 90 calendar days). First, let's sort the data by ticker, then by calendar day:
forward_filled_data = formatted_data.reset_index().sort_values(by=['ticker', 'date'])
forward_filled_data.head()
forward_filled_data.fillna(method='ffill', limit=90, inplace=True)
forward_filled_data.head()
We've have one row per ticker per day and we've forward-filled the values by 60 days, so we're ready to save our data as a .csv file and upload it to Self-Serve.
forward_filled_data.to_csv('lazy_prices_data.csv', index=False)
We're done creating our dataset! For the next steps in our analysis, see the Alphalens notebook.