Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
How can I analyze categorical data in notebook?

I am trying to analyze the profitability grade from morningstar but it is a string like 'A', 'B', 'C', 'D', 'F'. How could I assign a number to each letter in pipeline?

3 responses

Great question! The profitability grades in Morningstar are Classifiers (see https://www.quantopian.com/docs/api-reference/pipeline-api-reference#zipline.pipeline.Classifier). Classifiers typically return string or integer values. To change these values into a real number and return as a factor, one needs to write a small custom factor. The custom factor takes as its input the classifier then simply replaces values as one wishes.

Of course, nothing is ever that simple. The one detail here is the classifier values passed to a custom factor have a type of LabelArray. This type doesn't understand the typical numpy methods. In order to use numpy methods one must first change the LabelArray into a numpy string array. There is a as_string_array() method to do just that under the covers in zipline. So, a custom factor something like will work

class Profitability_Grade_As_Number(CustomFactor):  
            inputs=[Fundamentals.profitability_grade.latest]  
            window_length=1

            def compute(self, today, assets, out, grade):  
                # First convert grade, which is a LabelArray, into a string array  
                grade = grade.as_string_array()

                # The values of grade can be A, B, C, D, F and then + or -, or None  
                # If None, this will return nan  
                grade = np.where(grade=='A+', 5.25, grade)  
                grade = np.where(grade=='A', 5.0, grade)  
                grade = np.where(grade=='A-', 3.75, grade)  
                grade = np.where(grade=='B+', 4.25, grade)  
                grade = np.where(grade=='B', 4.0, grade)  
                grade = np.where(grade=='B-', 3.75, grade)  
                grade = np.where(grade=='C+', 3.25, grade)  
                grade = np.where(grade=='C', 3.0, grade)  
                grade = np.where(grade=='C-', 2.75, grade)  
                grade = np.where(grade=='D+', 2.25, grade)  
                grade = np.where(grade=='D', 2.0, grade)  
                grade = np.where(grade=='D-', 1.75, grade)  
                grade = np.where(grade=='F', 0.0, grade)  
                out[:] = grade

You can of course assign whatever values one wishes. Also, using the numpy where method is personal preference. There are a number of other approaches. The same thing can be accomplished with bracket notation. Speed seems to be about the same. Something like this

               # Using where as above could also be replaced with bracket notation and plain assignment  
               # The two approaches below have the same result  
                grade = np.where(grade=='A+', 5.25, grade)  
                grade[grade=='A+'] = 5.25

That's how you assign a number to each letter returned by a classifier. Hope that helped.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Thanks Dan. This works up to the point at which I use Q-grid to display the growth grade and integer for each stock. But it fails if I try to use it as a factor in Alphalens. I am getting a KeyError and it's pointing to the variable I am using to represent the growth grade integer.

factor = growth_int  

Can Alphalens analyze the integer as a factor? Do I need it to convert into some other data type?

This is the offending line:

factor=factor_data['factor'],  

in

# Decay Rate of the Alpha Factor  
longest_look_forward_period = 63 # week = 5, month = 21, quarter = 63, year = 252  
range_step = 5

merged_data = get_clean_factor_and_forward_returns(  
    factor=factor_data['factor'],  
    prices=pricing_data,  
    periods=range(1, longest_look_forward_period, range_step)  
)

mean_information_coefficient(merged_data).plot(title="IC Decay")  

Once you create a custom factor, that factor can be used just like any other factor. It can be sliced, diced, and inputted to Alphalens for analysis if one wishes. Not sure what is causing the error you are getting. Perhaps attach the notebook? However, here is a notebook that uses the integer factor version of profitability_grade in Alphalens. Check out the last several cells.

One thing, watch out when using Alphalens with factors which have a small number of discrete values. In this case the values are specific integers. In other cases there may simply be a lot of zeroes or ones. This sometimes doesn't work well when 'quantiles' are selected during get_clean_factor_and_forward_returns. The quantiles method tries to put an equal number of data points into each bin. With a small fixed number of factor values this may be impossible and there will be an error during the 'binning' phase. Just something to watch for.