Scraping 10-Ks and 10-Qs for Alpha (Data Cleaning)¶

THESIS:¶

Major text changes in 10-K and 10-Q filings over time indicate significant decreases in future returns. We find alpha in shorting the companies with the largest text changes in their filings and buying the companies with the smallest text changes in their filings.

Introduction¶

Publicly listed companies in the U.S. are required by law to file "10-K" and "10-Q" reports with the Securities and Exchange Commission (SEC). These reports provide both qualitative and quantitative descriptions of the company's performance, from revenue numbers to qualitative risk factors.

When companies file 10-Ks and 10-Qs, they are required to disclose certain pieces of information. For example, companies are required to report information about "significant pending lawsuits or other legal proceedings". As such, 10-Ks and 10-Qs often hold valuable insights into a company's performance.

These insights, however, can be difficult to access. The average 10-K was 42,000 words long in 2013; put in perspective, that's roughly one-fifth of the length of Moby-Dick. Beyond the sheer length, dense language and lots of boilerplate can further obfuscate true meaning for many investors.

The good news? We might not need to read companies' 10-Ks and 10-Qs from cover-to-cover in order derive value from the information they contain. Specifically, Lauren Cohen, Christopher Malloy and Quoc Nguyen argue in their recent paper that we can simply analyze textual changes in 10-Ks and 10-Qs to predict companies' future stock returns.

In this investigation, we attempt to replicate their results on Quantopian.

(For an overview of this paper from Lauren Cohen himself, see the Lazy Prices interview from QuantCon 2018.)

Hypothesis¶

Companies make major textual changes to their 10-Ks and 10-Qs when major things happen to their business. Thus, we expect that textual changes to 10-Ks and 10-Qs are a signal of future stock price movement.

Since the vast majority (86%) of textual changes have negative sentiment, we generally expect that major textual changes signal a decrease in stock price (Cohen et al. 2018).

Thus, we expect to find alpha by shorting companies with large textual changes in their 10-Ks and 10-Qs.

Methodology¶

Scrape every publicly traded company's 10-Ks and 10-Qs from the SEC EDGAR database. Remove extraneous content from the 10-Ks and 10-Qs (numerical tables, HTML tags, XBRL tags, etc).
For each company, compute cosine similarity and Jaccard similarity scores over the sequence of its 10-Ks and 10-Qs. Each 10-K is compared to the previous year's 10-K; each 10-Q is compared to the 10-Q from the same quarter of the previous year.
Compile these scores into one dataset.
Upload the data to Quantopian using Self-Serve Data, then use Alphalens to analyze the performance of 10-K and 10-Q text changes as an alpha factor.

This notebook covers steps 1-3. For step 4, see the Alphalens study notebook.

0. Running This Notebook¶

This notebook is intended to be run locally (on your own computer), not within the Quantopian Research environment. We run it locally in order to generate the .csv file for upload into the Self-Serve Data feature.

In order to run this notebook, you will need to have Python 3 and the following packages installed:

jupyter notebook
pandas (version 0.23.0)
numpy
requests
scikit-learn
BeautifulSoup
lxml
tqdm

All of these packages can be installed using conda or pip. For detailed installation instructions, see the installation documentation for each package (jupyter, pandas, numpy, Requests, scikit-learn, BeautifulSoup, lxml, tqdm).

To run this notebook:

Clone it into your own Quantopian account and open it in your research environment.
Download it as a .ipynb file (Notebook > Download as > Notebook (.ipynb))
Move the .ipynb notebook file to the desired directory on your local machine.
Open a command line window.
Use cd in the command line to navigate to the directory containing the notebook file.
Run jupyter notebook in the command line to start a jupyter notebook session.
A window should open in your default web browser displaying the contents of your current directory. Click the name of the .ipynb notebook file to open it.
Run the cells just as you would in the Quantopian Research environment.

# Importing built-in libraries (no need to install these)
import re
import os
from time import gmtime, strftime
from datetime import datetime, timedelta
import unicodedata

# Importing libraries you need to install
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import requests
import bs4 as bs
from lxml import html
from tqdm import tqdm

1. Data Scraping¶

We need to know what we want to scrape, so we'll begin by compiling a complete* list of U.S. stock tickers.

*for our purposes, "complete" = everything traded on NASDAQ, NYSE, or AMEX.

# Get lists of tickers from NASDAQ, NYSE, AMEX
nasdaq_tickers = pd.read_csv('https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nasdaq&render=download')
nyse_tickers = pd.read_csv('https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download')
amex_tickers = pd.read_csv('https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=amex&render=download')

# Drop irrelevant cols
nasdaq_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
nyse_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)
amex_tickers.drop(labels='Unnamed: 8', axis='columns', inplace=True)

# Create full list of tickers/names across all 3 exchanges
tickers = list(set(list(nasdaq_tickers['Symbol']) + list(nyse_tickers['Symbol']) + list(amex_tickers['Symbol'])))

Unfortunately, the SEC indexes company filings by its own internal identifier, the "Central Index Key" (CIK). We'll need to translate tickers into CIKs in order to search for company filings on EDGAR.

(The code below is an edited version of this gist.)

def MapTickerToCik(tickers):
    url = 'http://www.sec.gov/cgi-bin/browse-edgar?CIK={}&Find=Search&owner=exclude&action=getcompany'
    cik_re = re.compile(r'.*CIK=(\d{10}).*')

    cik_dict = {}
    for ticker in tqdm(tickers): # Use tqdm lib for progress bar
        results = cik_re.findall(requests.get(url.format(ticker)).text)
        if len(results):
            cik_dict[str(ticker).lower()] = str(results[0])
    
    return cik_dict

cik_dict = MapTickerToCik(tickers)

100%|██████████| 6817/6817 [42:54<00:00,  2.65it/s]

# Clean up the ticker-CIK mapping as a DataFrame
ticker_cik_df = pd.DataFrame.from_dict(data=cik_dict, orient='index')
ticker_cik_df.reset_index(inplace=True)
ticker_cik_df.columns = ['ticker', 'cik']
ticker_cik_df['cik'] = [str(cik) for cik in ticker_cik_df['cik']]

Our ultimate goal is to link each ticker to a unique CIK.

However, some CIKs might be linked to multiple tickers. For example, different share classes within the same company would all be linked to the same CIK. Let's get rid of these duplicate mappings.

# Check for duplicated tickers/CIKs
print("Number of ticker-cik pairings:", len(ticker_cik_df))
print("Number of unique tickers:", len(set(ticker_cik_df['ticker'])))
print("Number of unique CIKs:", len(set(ticker_cik_df['cik'])))

Number of ticker-cik pairings: 5064
Number of unique tickers: 5064
Number of unique CIKs: 4845

It looks like about 200 (4.5%) CIKs are linked to multiple tickers. To eliminate the duplicate mappings, we'll simply keep the ticker that comes first in the alphabet. In most cases, this means we'll keep the class A shares of the stock.

It's certainly possible to eliminate duplicates using other methods; for the sake of simplicity, we'll stick with alphabetizing for now. As long as we apply it uniformly across all stocks, it shouldn't introduce any bias.

# Keep first ticker alphabetically for duplicated CIKs
ticker_cik_df = ticker_cik_df.sort_values(by='ticker')
ticker_cik_df.drop_duplicates(subset='cik', keep='first', inplace=True)

# Check that we've eliminated duplicate tickers/CIKs
print("Number of ticker-cik pairings:", len(ticker_cik_df))
print("Number of unique tickers:", len(set(ticker_cik_df['ticker'])))
print("Number of unique CIKs:", len(set(ticker_cik_df['cik'])))

Number of ticker-cik pairings: 4845
Number of unique tickers: 4845
Number of unique CIKs: 4845

At this point, we have a list of the CIKs for which we want to obtain 10-Ks and 10-Qs. We can now begin scraping from EDGAR.

As with many web scraping projects, we'll need to keep some technical considerations in mind:

We're scraping a lot of data, so it's unlikely that we'll be able to do it all in one session without something breaking (most likely scenario: the WiFI disconnects briefly or your laptop goes to sleep). As such, we should make sure that our scraper can easily pick up where it left off without having to re-scrape anything.
We also probably want to log warnings/errors and save that log, just in case we need to reference it later.
The SEC limits users to 10 requests per second, so we need to make sure we're not making requests too quickly.

def WriteLogFile(log_file_name, text):
    
    '''
    Helper function.
    Writes a log file with all notes and
    error messages from a scraping "session".
    
    Parameters
    ----------
    log_file_name : str
        Name of the log file (should be a .txt file).
    text : str
        Text to write to the log file.
        
    Returns
    -------
    None.
    
    '''
    
    with open(log_file_name, "a") as log_file:
        log_file.write(text)

    return

The function below scrapes all 10-Ks and 10-K405s one particular CIK. Our web scraper primarily depends on the requests and BeautifulSoup libraries.

Note that the scraper creates a different directory for each CIK, and puts all the filings for that CIK within that directory. After scraping, your file structure should look like this:

- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
    - CIK2
        - 10K #1
        - 10K #2
        ...
    - CIK3
        - 10K #1
        - 10K #2
        ...
    ...
- 10Qs
    - CIK1
        - 10Q #1
        - 10Q #2
        ...
    - CIK2
        - 10Q #1
        - 10Q #2
        ...
    - CIK3
        - 10Q #1
        - 10Q #2
        ...
    ...

The scraper will create the directory for each CIK. However, we need to create different directories to hold our 10-K and 10-Q files. The exact pathname depends on your local setup, so you'll need to fill in the correct pathname_10k and pathname_10q for your machine.

pathname_10k = '< YOUR 10-K PATHNAME HERE>'
pathname_10q = '< YOUR 10-Q PATHNAME HERE>'

def Scrape10K(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name):
    
    '''
    Scrapes all 10-Ks and 10-K405s for a particular 
    CIK from EDGAR.
    
    Parameters
    ----------
    browse_url_base : str
        Base URL for browsing EDGAR.
    filing_url_base : str
        Base URL for filings listings on EDGAR.
    doc_url_base : str
        Base URL for one filing's document tables
        page on EDGAR.
    cik : str
        Central Index Key.
    log_file_name : str
        Name of the log file (should be a .txt file).
        
    Returns
    -------
    None.
    
    '''
    
    # Check if we've already scraped this CIK
    try:
        os.mkdir(cik)
    except OSError:
        print("Already scraped CIK", cik)
        return
    
    # If we haven't, go into the directory for that CIK
    os.chdir(cik)
    
    print('Scraping CIK', cik)
    
    # Request list of 10-K filings
    res = requests.get(browse_url_base % cik)
    
    # If the request failed, log the failure and exit
    if res.status_code != 200:
        os.chdir('..')
        os.rmdir(cik) # remove empty dir
        text = "Request failed with error code " + str(res.status_code) + \
               "\nFailed URL: " + (browse_url_base % cik) + '\n'
        WriteLogFile(log_file_name, text)
        return

    # If the request doesn't fail, continue...
    
    # Parse the response HTML using BeautifulSoup
    soup = bs.BeautifulSoup(res.text, "lxml")

    # Extract all tables from the response
    html_tables = soup.find_all('table')
    
    # Check that the table we're looking for exists
    # If it doesn't, exit
    if len(html_tables)<3:
        os.chdir('..')
        return
    
    # Parse the Filings table
    filings_table = pd.read_html(str(html_tables[2]), header=0)[0]
    filings_table['Filings'] = [str(x) for x in filings_table['Filings']]

    # Get only 10-K and 10-K405 document filings
    filings_table = filings_table[(filings_table['Filings'] == '10-K') | (filings_table['Filings'] == '10-K405')]

    # If filings table doesn't have any
    # 10-Ks or 10-K405s, exit
    if len(filings_table)==0:
        os.chdir('..')
        return
    
    # Get accession number for each 10-K and 10-K405 filing
    filings_table['Acc_No'] = [x.replace('\xa0',' ')
                               .split('Acc-no: ')[1]
                               .split(' ')[0] for x in filings_table['Description']]

    # Iterate through each filing and 
    # scrape the corresponding document...
    for index, row in filings_table.iterrows():
        
        # Get the accession number for the filing
        acc_no = str(row['Acc_No'])
        
        # Navigate to the page for the filing
        docs_page = requests.get(filing_url_base % (cik, acc_no))
        
        # If request fails, log the failure
        # and skip to the next filing
        if docs_page.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(docs_page.status_code) + \
                   "\nFailed URL: " + (filing_url_base % (cik, acc_no)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue

        # If request succeeds, keep going...
        
        # Parse the table of documents for the filing
        docs_page_soup = bs.BeautifulSoup(docs_page.text, 'lxml')
        docs_html_tables = docs_page_soup.find_all('table')
        if len(docs_html_tables)==0:
            continue
        docs_table = pd.read_html(str(docs_html_tables[0]), header=0)[0]
        docs_table['Type'] = [str(x) for x in docs_table['Type']]
        
        # Get the 10-K and 10-K405 entries for the filing
        docs_table = docs_table[(docs_table['Type'] == '10-K') | (docs_table['Type'] == '10-K405')]
        
        # If there aren't any 10-K or 10-K405 entries,
        # skip to the next filing
        if len(docs_table)==0:
            continue
        # If there are 10-K or 10-K405 entries,
        # grab the first document
        elif len(docs_table)>0:
            docs_table = docs_table.iloc[0]
        
        docname = docs_table['Document']
        
        # If that first entry is unavailable,
        # log the failure and exit
        if str(docname) == 'nan':
            os.chdir('..')
            text = 'File with CIK: %s and Acc_No: %s is unavailable' % (cik, acc_no) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue       
        
        # If it is available, continue...
        
        # Request the file
        file = requests.get(doc_url_base % (cik, acc_no.replace('-', ''), docname))
        
        # If the request fails, log the failure and exit
        if file.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(file.status_code) + \
                   "\nFailed URL: " + (doc_url_base % (cik, acc_no.replace('-', ''), docname)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue
        
        # If it succeeds, keep going...
        
        # Save the file in appropriate format
        if '.txt' in docname:
            # Save text as TXT
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.txt'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        else:
            # Save text as HTML
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.html'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        
    # Move back to the main 10-K directory
    os.chdir('..')
        
    return

def Scrape10Q(browse_url_base, filing_url_base, doc_url_base, cik, log_file_name):
    
    '''
    Scrapes all 10-Qs for a particular CIK from EDGAR.
    
    Parameters
    ----------
    browse_url_base : str
        Base URL for browsing EDGAR.
    filing_url_base : str
        Base URL for filings listings on EDGAR.
    doc_url_base : str
        Base URL for one filing's document tables
        page on EDGAR.
    cik : str
        Central Index Key.
    log_file_name : str
        Name of the log file (should be a .txt file).
        
    Returns
    -------
    None.
    
    '''
    
    # Check if we've already scraped this CIK
    try:
        os.mkdir(cik)
    except OSError:
        print("Already scraped CIK", cik)
        return
    
    # If we haven't, go into the directory for that CIK
    os.chdir(cik)
    
    print('Scraping CIK', cik)
    
    # Request list of 10-Q filings
    res = requests.get(browse_url_base % cik)
    
    # If the request failed, log the failure and exit
    if res.status_code != 200:
        os.chdir('..')
        os.rmdir(cik) # remove empty dir
        text = "Request failed with error code " + str(res.status_code) + \
               "\nFailed URL: " + (browse_url_base % cik) + '\n'
        WriteLogFile(log_file_name, text)
        return
    
    # If the request doesn't fail, continue...

    # Parse the response HTML using BeautifulSoup
    soup = bs.BeautifulSoup(res.text, "lxml")

    # Extract all tables from the response
    html_tables = soup.find_all('table')
    
    # Check that the table we're looking for exists
    # If it doesn't, exit
    if len(html_tables)<3:
        print("table too short")
        os.chdir('..')
        return
    
    # Parse the Filings table
    filings_table = pd.read_html(str(html_tables[2]), header=0)[0]
    filings_table['Filings'] = [str(x) for x in filings_table['Filings']]

    # Get only 10-Q document filings
    filings_table = filings_table[filings_table['Filings'] == '10-Q']

    # If filings table doesn't have any
    # 10-Ks or 10-K405s, exit
    if len(filings_table)==0:
        os.chdir('..')
        return
    
    # Get accession number for each 10-K and 10-K405 filing
    filings_table['Acc_No'] = [x.replace('\xa0',' ')
                               .split('Acc-no: ')[1]
                               .split(' ')[0] for x in filings_table['Description']]

    # Iterate through each filing and 
    # scrape the corresponding document...
    for index, row in filings_table.iterrows():
        
        # Get the accession number for the filing
        acc_no = str(row['Acc_No'])
        
        # Navigate to the page for the filing
        docs_page = requests.get(filing_url_base % (cik, acc_no))
        
        # If request fails, log the failure
        # and skip to the next filing    
        if docs_page.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(docs_page.status_code) + \
                   "\nFailed URL: " + (filing_url_base % (cik, acc_no)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue
            
        # If request succeeds, keep going...
        
        # Parse the table of documents for the filing
        docs_page_soup = bs.BeautifulSoup(docs_page.text, 'lxml')
        docs_html_tables = docs_page_soup.find_all('table')
        if len(docs_html_tables)==0:
            continue
        docs_table = pd.read_html(str(docs_html_tables[0]), header=0)[0]
        docs_table['Type'] = [str(x) for x in docs_table['Type']]
        
        # Get the 10-K and 10-K405 entries for the filing
        docs_table = docs_table[docs_table['Type'] == '10-Q']
        
        # If there aren't any 10-K or 10-K405 entries,
        # skip to the next filing
        if len(docs_table)==0:
            continue
        # If there are 10-K or 10-K405 entries,
        # grab the first document
        elif len(docs_table)>0:
            docs_table = docs_table.iloc[0]
        
        docname = docs_table['Document']
        
        # If that first entry is unavailable,
        # log the failure and exit
        if str(docname) == 'nan':
            os.chdir('..')
            text = 'File with CIK: %s and Acc_No: %s is unavailable' % (cik, acc_no) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue       
        
        # If it is available, continue...
        
        # Request the file
        file = requests.get(doc_url_base % (cik, acc_no.replace('-', ''), docname))
        
        # If the request fails, log the failure and exit
        if file.status_code != 200:
            os.chdir('..')
            text = "Request failed with error code " + str(file.status_code) + \
                   "\nFailed URL: " + (doc_url_base % (cik, acc_no.replace('-', ''), docname)) + '\n'
            WriteLogFile(log_file_name, text)
            os.chdir(cik)
            continue
            
        # If it succeeds, keep going...
        
        # Save the file in appropriate format
        if '.txt' in docname:
            # Save text as TXT
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.txt'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        else:
            # Save text as HTML
            date = str(row['Filing Date'])
            filename = cik + '_' + date + '.html'
            html_file = open(filename, 'a')
            html_file.write(file.text)
            html_file.close()
        
    # Move back to the main 10-Q directory
    os.chdir('..')
        
    return

Now that we've defined our scraper functions, let's scrape.

(A note from the future: we're scraping a lot of data, which takes time and space. For reference, these functions ultimately scraped 170 GB of 10-Qs and 125 GB of 10-Ks; the scraping took roughly 20 hours total.)

# Run the function to scrape 10-Ks

# Define parameters
browse_url_base_10k = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-K'
filing_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10k = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'

# Set correct directory
os.chdir(pathname_10k)

# Initialize log file
# (log file name = the time we initiate scraping session)
time = strftime("%Y-%m-%d %Hh%Mm%Ss", gmtime())
log_file_name = 'log '+time+'.txt'
with open(log_file_name, 'a') as log_file:
    log_file.close()

# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(ticker_cik_df['cik']):
    Scrape10K(browse_url_base=browse_url_base_10k, 
          filing_url_base=filing_url_base_10k, 
          doc_url_base=doc_url_base_10k, 
          cik=cik,
          log_file_name=log_file_name)

# Run the function to scrape 10-Qs

# Define parameters
browse_url_base_10q = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=%s&type=10-Q&count=1000'
filing_url_base_10q = 'http://www.sec.gov/Archives/edgar/data/%s/%s-index.html'
doc_url_base_10q = 'http://www.sec.gov/Archives/edgar/data/%s/%s/%s'

# Set correct directory (fill this out yourself!)
os.chdir(pathname_10q)

# Initialize log file
# (log file name = the time we initiate scraping session)
time = strftime("%Y-%m-%d %Hh%Mm%Ss", gmtime())
log_file_name = 'log '+time+'.txt'
log_file = open(log_file_name, 'a')
log_file.close()

# Iterate over CIKs and scrape 10-Ks
for cik in tqdm(ticker_cik_df['cik']):
    Scrape10Q(browse_url_base=browse_url_base_10q, 
          filing_url_base=filing_url_base_10q, 
          doc_url_base=doc_url_base_10q, 
          cik=cik,
          log_file_name=log_file_name)

We now have 10-Ks and 10-Qs in HTML or plaintext (.txt) format for each CIK. Before computing our similarity scores, however, we need to clean the files up a bit.

As outlined in the paper, we will:

... remove all tables (if their numeric character content is greater than 15%), HTML tags, XBRL tables, exhibits, ASCII-encoded PDFs, graphics, XLS, and other binary files.

def RemoveNumericalTables(soup):
    
    '''
    Removes tables with >15% numerical characters.
    
    Parameters
    ----------
    soup : BeautifulSoup object
        Parsed result from BeautifulSoup.
        
    Returns
    -------
    soup : BeautifulSoup object
        Parsed result from BeautifulSoup
        with numerical tables removed.
        
    '''
    
    # Determines percentage of numerical characters
    # in a table
    def GetDigitPercentage(tablestring):
        if len(tablestring)>0.0:
            numbers = sum([char.isdigit() for char in tablestring])
            length = len(tablestring)
            return numbers/length
        else:
            return 1
    
    # Evaluates numerical character % for each table
    # and removes the table if the percentage is > 15%
    [x.extract() for x in soup.find_all('table') if GetDigitPercentage(x.get_text())>0.15]
    
    return soup

def RemoveTags(soup):
    
    '''
    Drops HTML tags, newlines and unicode text from
    filing text.
    
    Parameters
    ----------
    soup : BeautifulSoup object
        Parsed result from BeautifulSoup.
        
    Returns
    -------
    text : str
        Filing text.
        
    '''
    
    # Remove HTML tags with get_text
    text = soup.get_text()
    
    # Remove newline characters
    text = text.replace('\n', ' ')
    
    # Replace unicode characters with their
    # "normal" representations
    text = unicodedata.normalize('NFKD', text)
    
    return text

def ConvertHTML(cik):
    
    '''
    Removes numerical tables, HTML tags,
    newlines, unicode text, and XBRL tables.
    
    Parameters
    ----------
    cik : str
        Central Index Key used to scrape files.
    
    Returns
    -------
    None.
    
    '''
    
    # Look for files scraped for that CIK
    try: 
        os.chdir(cik)
    # ...if we didn't scrape any files for that CIK, exit
    except FileNotFoundError:
        print("Could not find directory for CIK", cik)
        return
        
    print("Parsing CIK %s..." % cik)
    parsed = False # flag to tell if we've parsed anything
    
    # Try to make a new directory within the CIK directory
    # to store the text representations of the filings
    try:
        os.mkdir('rawtext')
    # If it already exists, continue
    # We can't exit at this point because we might be
    # partially through parsing text files, so we need to continue
    except OSError:
        pass
    
    # Get list of scraped files
    # excluding hidden files and directories
    file_list = [fname for fname in os.listdir() if not (fname.startswith('.') | os.path.isdir(fname))]
    
    # Iterate over scraped files and clean
    for filename in file_list:
            
        # Check if file has already been cleaned
        new_filename = filename.replace('.html', '.txt')
        text_file_list = os.listdir('rawtext')
        if new_filename in text_file_list:
            continue
        
        # If it hasn't been cleaned already, keep going...
        
        # Clean file
        with open(filename, 'r') as file:
            parsed = True
            soup = bs.BeautifulSoup(file.read(), "lxml")
            soup = RemoveNumericalTables(soup)
            text = RemoveTags(soup)
            with open('rawtext/'+new_filename, 'w') as newfile:
                newfile.write(text)
    
    # If all files in the CIK directory have been parsed
    # then log that
    if parsed==False:
        print("Already parsed CIK", cik)
    
    os.chdir('..')
    return

We can now apply this function to each of our 10-K and 10-Q files.

# For 10-Ks...

os.chdir(pathname_10k)

# Iterate over CIKs and clean HTML filings
for cik in tqdm(ticker_cik_df['cik']):
    ConvertHTML(cik)

# For 10-Qs...

os.chdir(pathname_10q)

# Iterate over CIKs and clean HTML filings
for cik in tqdm(ticker_cik_df['cik']):
    ConvertHTML(cik)

After running the two cells above, we have cleaned plaintext 10-K and 10-Q filings for each CIK. At this point, our file structure looks like this:

- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
        - rawtext
    - CIK2
        - 10K #1
        - 10K #2
        ...
        - rawtext
    - CIK3
        - 10K #1
        - 10K #2
        ...
        - rawtext
    ...
- 10Qs
    - CIK1
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
    - CIK2
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
    - CIK3
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
    ...

We can now begin computing our alpha factor (similarity scores).

2. Computing Similarity Scores¶

We'll use cosine similarity and Jaccard similarity to compare documents.

(The original paper also uses two other, simpler similarity measures, but cosine and Jaccard appeared to result in the best alpha factor performance -- and are much less computationally intensive to compute.)

def ComputeCosineSimilarity(words_A, words_B):
    
    '''
    Compute cosine similarity between document A and
    document B.
    
    Parameters
    ----------
    words_A : set
        Words in document A.
    words_B : set
        Words in document B
        
    Returns
    -------
    cosine_score : float
        Cosine similarity between document
        A and document B.
        
    '''
    
    # Compile complete set of words in A or B
    words = list(words_A.union(words_B))
    
    # Determine which words are in A
    vector_A = [1 if x in words_A else 0 for x in words]
    
    # Determine which words are in B
    vector_B = [1 if x in words_B else 0 for x in words]
    
    # Compute cosine score using scikit-learn
    array_A = np.array(vector_A).reshape(1, -1)
    array_B = np.array(vector_B).reshape(1, -1)
    cosine_score = cosine_similarity(array_A, array_B)[0,0]
    
    return cosine_score

def ComputeJaccardSimilarity(words_A, words_B):
    
    '''
    Compute Jaccard similarity between document A and
    document B.
    
    Parameters
    ----------
    words_A : set
        Words in document A.
    words_B : set
        Words in document B
        
    Returns
    -------
    jaccard_score : float
        Jaccard similarity between document
        A and document B.
        
    '''
    
    # Count number of words in both A and B
    words_intersect = len(words_A.intersection(words_B))
    
    # Count number of words in A or B
    words_union = len(words_A.union(words_B))
    
    # Compute Jaccard similarity score
    jaccard_score = words_intersect / words_union
    
    return jaccard_score

Before continuing, let's double-check that these functions are working properly.

The paper gives the following sample sentences to compare:

: We expect demand to increase. : We expect worldwide demand to increase. : We expect weakness in sales.

As noted in the paper, the cosine similarity between and should be , and the cosine similarity between and should be .

Meanwhile, the Jaccard similarity between and should be , and the Jaccard similarity between and should be .

Let's double-check that our functions return the correct results.

d_a = set(['we', 'expect', 'demand', 'to', 'increase'])
d_b = set(['we', 'expect', 'worldwide', 'demand', 'to', 'increase'])
d_c = set(['we', 'expect', 'weakness', 'in', 'sales'])

print("Cosine similarity between A and B:", ComputeCosineSimilarity(d_a, d_b))
print("Cosine similarity between A and C:", ComputeCosineSimilarity(d_a, d_c))
print("Jaccard similarity between A and B:", ComputeJaccardSimilarity(d_a, d_b))
print("Jaccard similarity between A and C:", ComputeJaccardSimilarity(d_a, d_c))

Cosine similarity between A and B: 0.9128709291752769
Cosine similarity between A and C: 0.39999999999999997
Jaccard similarity between A and B: 0.8333333333333334
Jaccard similarity between A and C: 0.25

Everything looks good! Now, let's begin applying these similarity computations to the scraped 10-Ks and 10-Qs.

We'll start with 10-Qs. This is slightly difficult, because we want to compare each 10-Q to the 10-Q from the same quarter of the previous year.

Keep in mind that 10-Qs are filed three times per year, and 10-Ks are filed once per year. Thus, we should simply be able to order the 10-Qs by filing date, then compare each 10-Q to the third-to-last file. For example, if our sorted list of 10-Q files was: [10Q-1, 10Q-2, 10Q-3, 10Q-4, 10Q-5, 10Q-6...] then we would iterate over the list and compare 10Q-4 to 10Q-1, 10Q-5 to 10Q-2, and so on.

Unfortunately, filings aren't so clean in the real world. Sometimes, companies don't file 10-Qs for a quarter (or more) for a variety of reasons. We don't want to compare 10-Qs from different quarters; quarter-on-quarter differences will create misleading noise in our ultimate factor values. [why don't companies file 10Ks? why are 10Qs different per quarter?]

As such, we'll take each 10-Q and look for 10-Qs that are dated between 345 and 385 days earlier. If one exists, we'll compute the similarity; if no such file exists, we'll report the scores as NaN.

def ComputeSimilarityScores10Q(cik):
    
    '''
    Computes cosine and Jaccard similarity scores
    over 10-Qs for a particular CIK.
    
    Compares each 10-Q to the 10-Q from the same
    quarter of the previous year.
    
    Parameters
    ----------
    cik: str
        Central Index Key used to scrape and name
        files.
        
    Returns
    -------
    None.
    
    '''
    
    # Define how stringent we want to be about 
    # "previous year"
    year_short = timedelta(345)
    year_long = timedelta(385)
    
    # Open directory that holds plain 10-Q textfiles
    # for the CIK
    os.chdir(cik+'/rawtext')
    print("Parsing CIK %s..." % cik)
    
    # Get list of files to compare
    file_list = [fname for fname in os.listdir() if not 
                 (fname.startswith('.') | os.path.isdir(fname))]
    file_list.sort()
    
    # Check if scores have already been calculated
    try:
        os.mkdir('../metrics')
    # ... if they have already been calculated, exit
    except OSError:
        print("Already parsed CIK %s..." % cik)
        os.chdir('../..')
        return
    
    # Check if enough files exist to compare
    # ... if there aren't enough files, exit
    if len(file_list) < 4:
        print("No files to compare for CIK", cik)
        os.chdir('../..')
        return
    
    # Initialize dataframe to hold similarity scores
    dates = [x[-14:-4] for x in file_list]
    cosine_score = [0]*len(dates)
    jaccard_score = [0]*len(dates)
    data = pd.DataFrame(columns={'cosine_score': cosine_score, 
                                 'jaccard_score': jaccard_score},
                       index=dates)
    
    # Iterate over each quarter...
    for j in range(3):
        
        # Get text and date of earliest filing from that quarter
        file_name_A = file_list[j]
        with open(file_name_A, 'r') as file:
            file_text_A = file.read()
        date_A = datetime.strptime(file_name_A[-14:-4], '%Y-%m-%d')
        
        # Iterate over the rest of the filings from that quarter...
        for i in range(j+3, len(file_list), 3):

            # Get name and date of the later file
            file_name_B = file_list[i]
            date_B = datetime.strptime(file_name_B[-14:-4], '%Y-%m-%d')
            
            # If B was not filed within ~1 year after A...
            if (date_B > (date_A + year_long)) or (date_B < (date_A + year_short)):
                
                print(date_B.strftime('%Y-%m-%d'), "is not within a year of", date_A.strftime('%Y-%m-%d'))
                
                # Record values as NaN
                data.at[date_B.strftime('%Y-%m-%d'), 'cosine_score'] = 'NaN'
                data.at[date_B.strftime('%Y-%m-%d'), 'jaccard_score'] = 'NaN'
                
                # Pretend as if we found new date_A in the next year
                date_A = date_A.replace(year=date_B.year)
                
                # Move to next filing
                continue
                
            # If B was filed within ~1 year of A...
            
            # Get file text
            with open(file_name_B, 'r') as file:
                file_text_B = file.read()

            # Get sets of words in A, B
            words_A = set(re.findall(r"[\w']+", file_text_A))
            words_B = set(re.findall(r"[\w']+", file_text_B))

            # Calculate similarity score
            cosine_score = ComputeCosineSimilarity(words_A, words_B)
            jaccard_score = ComputeJaccardSimilarity(words_A, words_B)

            # Store value (indexing by the date of document B)
            data.at[date_B.strftime('%Y-%m-%d'), 'cosine_score'] = cosine_score
            data.at[date_B.strftime('%Y-%m-%d'), 'jaccard_score'] = jaccard_score

            # Reset value for next loop
            # Don't re-read files, for efficiency
            file_text_A = file_text_B
            date_A = date_B

    # Save scores
    os.chdir('../metrics')
    data.to_csv(cik+'_sim_scores.csv', index=True)
    os.chdir('../..')

Fortunately, 10-Ks are easier. Though there can still be time-jumps in 10-K filings, we don't mind as much if we're comparing a 10-K from 2006 to a 10-K from 2002. This is because we don't have to worry about non-substantive quarter-on-quarter differences as we would with 10-Qs. In fact, it might actually be better if our data reflects textual changes in 10-Ks across "absent" years.

def ComputeSimilarityScores10K(cik):
    
    '''
    Computes cosine and Jaccard similarity scores
    over 10-Ks for a particular CIK.
    
    Parameters
    ----------
    cik: str
        Central Index Key used to scrape and name
        files.
        
    Returns
    -------
    None.
    
    '''
    
    # Open the directory that holds plaintext
    # filings for the CIK
    os.chdir(cik+'/rawtext')
    print("Parsing CIK %s..." % cik)
    
    # Get list of files to over which to compute scores
    # excluding hidden files and directories
    file_list = [fname for fname in os.listdir() if not 
                 (fname.startswith('.') | os.path.isdir(fname))]
    file_list.sort()
    
    # Check if scores have already been calculated...
    try:
        os.mkdir('../metrics')
    # ... if they have been, exit
    except OSError:
        print("Already parsed CIK %s..." % cik)
        os.chdir('../..')
        return
    
    # Check if enough files exist to compute sim scores...
    # If not, exit
    if len(file_list) < 2:
        print("No files to compare for CIK", cik)
        os.chdir('../..')
        return
    
    # Initialize dataframe to store sim scores
    dates = [x[-14:-4] for x in file_list]
    cosine_score = [0]*len(dates)
    jaccard_score = [0]*len(dates)
    data = pd.DataFrame(columns={'cosine_score': cosine_score, 
                                 'jaccard_score': jaccard_score},
                       index=dates)
        
    # Open first file
    file_name_A = file_list[0]
    with open(file_name_A, 'r') as file:
        file_text_A = file.read()
        
    # Iterate over each 10-K file...
    for i in range(1, len(file_list)):

        file_name_B = file_list[i]

        # Get file text B
        with open(file_name_B, 'r') as file:
            file_text_B = file.read()

        # Get set of words in A, B
        words_A = set(re.findall(r"[\w']+", file_text_A))
        words_B = set(re.findall(r"[\w']+", file_text_B))

        # Calculate similarity scores
        cosine_score = ComputeCosineSimilarity(words_A, words_B)
        jaccard_score = ComputeJaccardSimilarity(words_A, words_B)

        # Store score values
        date_B = file_name_B[-14:-4]
        data.at[date_B, 'cosine_score'] = cosine_score
        data.at[date_B, 'jaccard_score'] = jaccard_score

        # Reset value for next loop
        # (We don't open the file again, for efficiency)
        file_text_A = file_text_B

    # Save scores
    os.chdir('../metrics')
    data.to_csv(cik+'_sim_scores.csv', index=False)
    os.chdir('../..')

Note that we store factor values according to the date of the later document (document B). This is because we want our data to be point-in-time; we want to store the factor values according to the date that we would have known about them in the past.

In this case, our values (similarity scores) depend on two things: the content of document A, and the content of document B. We would have become aware of the text of document A at date_A, and we would have become aware of the text of document B at date_B.

Remember that A is stipulated to be the earlier document. Since A precedes B, we wouldn't know about the factor values at date_A; the content of B would not yet be available. However, we would know about the factor values at date_B -- at that point, both the content of A and the content of B would be available. As such, we'll store our values according to date_B.

(The above applies to both 10-Qs and 10-Ks.)

Let's go ahead and apply these functions to our stored 10-Qs and 10-Ks.

# Computing scores for 10-Qs...

os.chdir(pathname_10q)

for cik in tqdm(ticker_cik_df['cik']):
    ComputeSimilarityScores10Q(cik)

# Computing scores for 10-Ks...

os.chdir(pathname_10k)

for cik in tqdm(ticker_cik_df['cik']):
    ComputeSimilarityScores10K(cik)

After computing the similarity scores, our file structure looks like this:

- 10Ks
    - CIK1
        - 10K #1
        - 10K #2
        ...
        - rawtext
        - metrics
    - CIK2
        - 10K #1
        - 10K #2
        ...
        - rawtext
        - metrics
    - CIK3
        - 10K #1
        - 10K #2
        ...
        - rawtext
        - metrics
    ...
- 10Qs
    - CIK1
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
        - metrics
    - CIK2
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
        - metrics
    - CIK3
        - 10Q #1
        - 10Q #2
        ...
        - rawtext
        - metrics
    ...

The similarity scores for each CIK are stored in the metrics directory as a .csv file.

3. Compiling the Dataset¶

Now that we've scraped the data and computed the similarity scores, we're almost done. The final step is to format our data properly for upload to Self-Serve Data.

We'll begin by consolidating the .csv files in the 10-K and 10-Q directories into a single DataFrame for each CIK.

def GetData(cik, pathname_10k, pathname_10q, pathname_data):
    
    '''
    Consolidate 10-K and 10-Q data into a single dataframe
    for a CIK.
    
    Parameters
    ----------
    cik : str
        Central Index Key used to scrape and
        store data.
    pathname_10k : str
        Path to directory holding 10-K files.
    pathname_10q : str
        Path to directory holding 10-Q files.
    pathname_data : str
        Path to directory holding newly
        generated data files.
        
    Returns
    -------
    None.
    
    '''
    
    # Flags to determine what data we have
    data_10k = True
    data_10q = True
    
    print("Gathering data for CIK %s..." % cik)
    file_name = ('%s_sim_scores_full.csv' % cik)
    
    # Check if data has already been gathered...
    os.chdir(pathname_data)
    file_list = [fname for fname in os.listdir() if not fname.startswith('.')]
    
    # ... if it has been, exit
    if file_name in file_list:
        print("Already gathered data for CIK", cik)
        return
    
    # Try to get 10-K data...
    os.chdir(pathname_10k+'/%s/metrics' % cik)
    try:
        sim_scores_10k = pd.read_csv(cik+'_sim_scores.csv')
    # ... if it doesn't exist, set 10-K flag to False
    except FileNotFoundError:
        print("No data to gather.")
        data_10k = False
    
    # Try to get 10-Q data...
    os.chdir(pathname_10q+'/%s/metrics' % cik)
    try:
        sim_scores_10q = pd.read_csv(cik+'_sim_scores.csv')
    # ... if it doesn't exist, set 10-Q flag to False
    except FileNotFoundError:
        print("No data to gather.")
        data_10q = False
    
    # Merge depending on available data...
    # ... if there's no 10-K or 10-Q data, exit
    if not (data_10k and data_10q):
        return
    
    # ... if there's no 10-Q data (but there is 10-K data),
    # only use the 10-K data
    if not data_10q:
        sim_scores = sim_scores_10k
    # ... if the opposite is true, only use 10-Q data
    elif not data_10k:
        sim_scores = sim_scores_10q
    # ... if there's both 10-K and 10-Q data, merge
    elif (data_10q and data_10k):
        sim_scores = pd.concat([sim_scores_10k, sim_scores_10q], 
                           axis='index')
    
    # Rename date column
    sim_scores.rename(columns={'Unnamed: 0': 'date'}, inplace=True)

    # Set CIK column
    sim_scores['cik'] = cik
    
    # Save file in the data dir
    os.chdir(pathname_data)
    sim_scores.to_csv('%s_sim_scores_full.csv' % cik, index=False)
    
    return

pathname_data = '< YOUR DATA PATHNAME HERE >' # Fill this out

for cik in tqdm(ticker_cik_df['cik']):
    GetData(cik, pathname_10k, pathname_10q, pathname_data)

Now, we have a "data" directory that looks like this:

- data
    - CIK1_sim_scores_full.csv
    - CIK2_sim_scores_full.csv
    ...

Of course, we need to consolidate each CIK's data into a single dataset.

def MakeDataset(file_list, pathname_full_data):
    
    '''
    Consolidates CIK datasets into a
    single dataset.
    
    Parameters
    ----------
    file_list : list
        List of .csv files to merge.
    pathname_full_data : str
        Path to directory to store
        full dataset.
        
    Returns
    -------
    None.
    
    '''
    
    # Initialize dataframe to store results
    data = pd.DataFrame(columns=['date', 'cosine_score', 'jaccard_score', 'cik'])
    
    # Iterate over files and merge all together
    for file_name in tqdm(file_list):
        new_data = pd.read_csv(file_name)
        data = data.append(new_data, sort=True)
    
    # Store result
    os.chdir(pathname_full_data)
    data.to_csv('all_sim_scores.csv', index=False)
    
    return

pathname_full_data = '< YOUR FULL DATA PATHNAME HERE >'

os.chdir(pathname_data)
file_list = [fname for fname in os.listdir() if not fname.startswith('.')]

MakeDataset(file_list, pathname_full_data)

The final step is to transform the data into a format appropriate for Self-Serve Data. This means that we want a dataset with one set of factor values per ticker per day. In other words, each day-ticker pair should have a cosine_score and a jaccard_score value.

In this step, we'll need to:

Map CIKs back to tickers. To do this, we'll simply merge our full dataset with ticker_cik_df.
Forward-fill values for 60 calendar days (time limit per the original paper).
Construct a dataset with one set of factor values per ticker per day (with NaNs for missing values).

sim_scores_full = pd.read_csv('all_sim_scores.csv')

# Cast CIKs as strings
sim_scores_full['cik'] = [str(x) for x in sim_scores_full['cik']]

# Merge to map tickers to CIKs
sim_scores_ticker = sim_scores_full.merge(ticker_cik_df, how='left', on='cik')

# Drop CIK column
sim_scores_ticker.drop(labels=['cik'], axis='columns', inplace=True)

# Drop NaN values
sim_scores_ticker.dropna(axis='index', how='any', subset=['jaccard_score', 'cosine_score'], inplace=True)

The sim_scores_ticker data has one row for each filing, listing the set of factor values, ticker, and date. However, some day-ticker pairs have no associated factor values. We need to manipulate this data so that we have one row per ticker per day.

To do this, we'll begin with an empty dataframe that contains one row per ticker per day. We'll then join this formatted empty dataframe (empty_data) with our actual data (sim_scores_ticker) in such a way that preserves all the rows of empty_data. We'll end up with a dataframe that contains all the data from sim_scores_ticker, with NaNs inserted for day-ticker pairs where we're missing values.

def InitializeEmptyDataframe(start_date, end_date, tickers):
    
    '''
    Initializes an empty DataFrame with all correct indices 
    (1 entry/ticker/day)
    
    Parameters
    ----------
    start_date : datetime.datetime
        Start date of dataframe.
    end_date : datetime.datetime
        End date of dataframe.
    tickers : list
        List of tickers.
    '''
    
    window_length_days = int((end_date - start_date).days)
    date_list = [start_date+timedelta(days=x) for x in range(0, window_length_days)]
    long_date_list = date_list * len(tickers)
    long_date_list = [x.strftime('%Y-%m-%d') for x in long_date_list]
    list.sort(long_date_list)
    empty = pd.DataFrame(data={'date': long_date_list, 
                                     'ticker': tickers*len(date_list),
                                'jaccard_score': [np.nan]*len(tickers)*len(date_list),
                              'cosine_score': [np.nan]*len(tickers)*len(date_list)})
    empty = empty.groupby(['date', 'ticker']).sum()
    
    empty['jaccard_score'] = np.nan
    empty['cosine_score'] = np.nan
    
    return empty

# Initialize empty dataframe
start_date = datetime(2015, 1, 1)
end_date = datetime(2018, 1, 1)
tickers = list(set(sim_scores_ticker['ticker']))

empty_data = InitializeEmptyDataframe(start_date, end_date, tickers)

(Note that we set start_date = datetime(2013, 1, 1). This is because Self-Serve Data has a maximum file size of 300 MB; too large of a date range will exceed that maximum. The dataset spanning 2013-2018 is 250 MB.)

empty_data.head()

# Format sim_scores data for merging
sim_scores_formatted = sim_scores_ticker.dropna(axis='index', how='any', subset=['jaccard_score', 'cosine_score'])

sim_scores_formatted = sim_scores_formatted.groupby(['date', 'ticker']).agg('mean')

Note that we use the .agg('mean') aggregator. This means that we'll take any rows that match each ticker-day pair and average the factor values.

In most cases, there should only be one row per ticker-day pair; however, there are some cases where 10-Ks and 10-Qs are filed on the same day, thus creating the need for .agg('mean').

sim_scores_formatted.head()

formatted_data = empty_data.join(sim_scores_formatted,
                                 how='left', 
                                 on=['date', 'ticker'], 
                                 lsuffix='_empty')

formatted_data.drop(labels=['cosine_score_empty', 'jaccard_score_empty'], axis='columns', inplace=True)

formatted_data.head()

Our final step is to forward-fill the values by one quarter (approximately 90 calendar days). First, let's sort the data by ticker, then by calendar day:

forward_filled_data = formatted_data.reset_index().sort_values(by=['ticker', 'date'])

forward_filled_data.head()

forward_filled_data.fillna(method='ffill', limit=90, inplace=True)

forward_filled_data.head()

We've have one row per ticker per day and we've forward-filled the values by 60 days, so we're ready to save our data as a .csv file and upload it to Self-Serve.

forward_filled_data.to_csv('lazy_prices_data.csv', index=False)

We're done creating our dataset! For the next steps in our analysis, see the Alphalens notebook.

		CIK	cosine_score	jaccard_score
date	ticker
2000-07-05	gmz	886982	0.734078	0.573545
2000-07-05	gs	886982	0.734078	0.573545
2000-07-14	sonc	868611	0.756802	0.608235
2000-07-17	adbe	796343	0.771085	0.627273
2000-07-28	aaba	1011006	0.744055	0.592149

		jaccard_score	cosine_score
date	ticker
2015-01-01	a	NaN	NaN
	aa	NaN	NaN
	aaba	NaN	NaN
	aac	NaN	NaN
	aal	NaN	NaN

	date	ticker	CIK	cosine_score	jaccard_score
0	2015-01-01	a	NaN	NaN	NaN
3970	2015-01-02	a	NaN	NaN	NaN
7940	2015-01-03	a	NaN	NaN	NaN
11910	2015-01-04	a	NaN	NaN	NaN
15880	2015-01-05	a	NaN	NaN	NaN