String Columns Now Available in Pipeline

Hi All,

One of the biggest remaining holes in Pipeline since the launch of Classifiers has been lack of support for string-typed data. Support for strings was merged in Zipline about a week ago, and as of today we now have support on Quantopian for loading string data in Pipelines.

There are two major use-cases for strings:

Converting them into booleans via string-matching predicates (e.g. "startswith").
Using them as grouping keys to transform numerical expressions (e.g. Z-Score asset returns by country code).

The groupby use-case works for strings exactly the way it does for integer columns like SectorCode. The Classifier announcement post provides an overview of grouping operations, and there's a new Working with Strings section in the Pipeline docs that provides another example with a string column.

The use-case of implementing filters based on string data is supported by a suite of new methods on Classifier:

More information on each of these methods is available in the Classifier API Reference.

To demonstrate the kinds of operations one might want to do with string-based filters, I've attached a notebook that implements 9 common universe selection criteria in Pipeline and analyzes their outputs.

This analysis is a step toward eventually providing recommended synthetic trading universes (e.g. a "Quanto 500" or "Quanto 3000") as efficient Pipeline built-ins, so I'm interested to hear if there are other interesting filtering criteria that could be included in the analysis.

Scott

This analysis is a step toward eventually providing recommended synthetic trading universes (e.g. a "Quanto 500" or "Quanto 3000") as efficient Pipeline built-ins, so I'm interested to hear if there are other interesting filtering criteria that could be included in the analysis.

Bad data is a problem, and will continue to be a problem. It'd be great if as soon as bad data is encountered, it could be added to lists that would be used for filtering, via pipeline. Possible? Presumably, you are already thinking along these lines, since any recommended synthetic trading universes would have clean point-in-time minutely OHLCV bar data, along with any associated auxiliary data provided by Quantopian. Another approach would be to provide canned routines, kept up-to-date by Quantopian, to screen for bad data, using pipeline. Basically, give the tool a list of securities, and return the good and bad ones, perhaps along with information about the problems. Possible?