This is the second post in a series on using Machine Learning in pairs trading (the first post is here). This post was motivated by the question
Is it possible to find valid eligible pairs without using any price data at all?
This seems like an impossible task. After all, isn't a pairs trading strategy a price-driven strategy? In this post I suggest that it is indeed possible. Using a public dataset of text descriptions of companies, I train a Machine Learning model to read about companies and find company pairs with similar descriptions. A visual analysis of the companies discovered indicates that their prices move together. This post uses basic concepts from Natural Language Processing and classes in scikit-learn
all available in Quantopian Research. In addition to being directly applicable to finding eligible pairs, this post also is an example, generally, of a process to reduce unstructured data to structured form.