Identifying industry average values for fundamentals

Back to Community

posted Nov 2, 2015

Hi,

Suppose I am interested in the mean of ROA ratios for a particular sector that the stock trades in, does Quantopian allow this?

10 responses

Andrew Martin

Nov 2, 2015

I haven't played around with Fundamentals, but the documentation can be found here:
https://www.quantopian.com/help/fundamentals

Josh Payne

Nov 2, 2015

Yup, I'm betting you can accomplish this with get_fundamentals in Quantopian Research.

There's a tutorial for using fundamentals in the Tutorials folder in Research.

You can identify the sector by specifically querying for the symbol or sid in get_fundamentals and then you can query for the ROA for that sector for a particular date or set of dates. The fields are described in the link provided by Andrew above.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Eliot Gable

Dec 31, 2015

For anyone else looking to do this, here is an example of a custom factor I wrote which calculates the return / gain percentage of a given industry:

def GainPctInd(offset=0, nbars=2):  
    class GainPctIndFact(CustomFactor):  
        window_length = nbars + offset  
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]  
        def compute(self, today, assets, out, close, industries):  
            num_bars, num_assets = close.shape  
            newest_bar_idx = (num_bars - 1) - offset  
            oldest_bar_idx = newest_bar_idx - (nbars - 1)

            # Compute the gain percents for all stocks  
            asset_gainpct = ((close[newest_bar_idx] - close[oldest_bar_idx]) / close[oldest_bar_idx]) * 100

            # For each industry, build a list of the per-stock gains over the given window  
            unique_ind = np.unique(industries[0,])  
            for industry in unique_ind:  
                ind_view = asset_gainpct[industries[0,] == industry]  
                ind_mean = np.nanmean(ind_view)  
                out[industries[0,] == industry] = ind_mean  
    return GainPctIndFact()

I don't think there are any remaining bugs in it at this point, but you should test it yourself before you trust it with real money.

Anyway, how it works is rather slick / interesting (I'm new to NumPY, so it seems that way to me at least).

The first line is a function definition:

def GainPctInd(offset=0, nbars=2):

The offset parameter lets me pick how far back in time I want to start calculating the return. So, with daily bars of data, 5 would represent 1 week ago. The nbars parameter lets me pick how many bars (days) of data I want to include in the calculation. So, for example, if I want the industry return for the past month, I would use GainPctInd(0,20) because there are 5 trading days per week and roughly 4 weeks in a month. If I wanted the industry return for the previous month, I would use GainPctInd(20, 20). The offset really controls where you want the calculation to stop while offset+nbars tells where you want it to begin. This is because the returns are always forward calculated in time, not reverse calculated in time.

The next thing you see in the code is the class definition. You will notice window_length is dynamically adjusted based on offset and nbars:

window_length = nbars + offset

Even though we are returning the class at the end of the function, this works because the function is "closed over" the class. If you don't understand how this works, you probably want to go read up on enclosures / closures. But, in short, because the function wraps around the class definition, any variables visible to the function will be visible in the class and will always hold the same value that they held when inside the function when the class was defined. This is essentially a class-factory function. Every time it is called, we generate a new class definition with new values for offset and nbars present in the class.

Inside the compute function, for readability sake, I started out deriving the index location of the most recent bar of close prices and the oldest bar of close prices based on my offset and nbars. I chose to calculate the values the way I did because I feel it is easier to glance at and determine correctness vs using negative indices. For reference, this is the code I just discussed:

            num_bars, num_assets = close.shape  
            newest_bar_idx = (num_bars - 1) - offset  
            oldest_bar_idx = newest_bar_idx - (nbars - 1)

Next, we get to the first NumPY calculation:

asset_gainpct = ((close[newest_bar_idx] - close[oldest_bar_idx]) / close[oldest_bar_idx]) * 100

Unless you are familiar with NumPY (I wasn't when I started writing this) it probably doesn't do what you think it does. To an un-initiated reader, it looks like it calculates the return of a single asset. However, the close variable is actually a multi-dimensional NumPY array. It's shape is window_lengthx"number of assets", which is typically written as (window_length, num_assets). That means it contains all the close prices for every asset for every day requested by the window_length. The 0-axis is the window_length which allows you to reference the day for which you want a close price while the 1-axis allows you to reference the specific asset for which you want the close price. At the time I tested, ticker symbol AAPL was asset index 2. So, if I want to grab the oldest available close price for Apple, I would reference close[0,2]. If I wanted to return all close prices for Apple, I would reference close[0:,2] which essentially says "give me a new array containing all the values along the 0-axis for the asset in asset-index 2." Now, I say essentially because it doesn't really give you a new array. It actually gives you a view into the existing array.

So, what do I mean by view? Well, if you are familiar with C / C++ or just pointers in general, it is basically constructing an array of pointers where the pointers point to the location in memory where the data is held for the close array. When you perform a NumPY operation on the view, it actually dereferences those pointers (follows them to their location in memory) and uses the data stored at that location. This is important, because it means you are not actually copying the data to a new location in memory when you construct a view. This brings us to how the line of code actually works...

The close[newest_bar_idx] construct actually creates a view of the close array which looks like a 1-dimensional array of the stock prices for all assets on the day referred to by newest_bar_idx. The same thing happens for the close[oldest_bar_idx] piece of the equation. NumPY knows how to subtract one view from another, so it traverses the close array and subtracts the data pointed to by the close[oldest_bar_idx] view from the data pointed to by the close[newest_bar_idx] view and stores it in a new temporary NumPY array. It then applies a division operation and arrives at a new temporary NumPY array. Then the multiplication is applied, which derives the final NumPY array which we assign to the asset_gainpct name. Now, this may be further optimized inside NumPY or Python to avoid making multiple temporary NumPY arrays. It might simply update the existing temporary array with new values as it goes. I don't really know for sure. But the final result is that asset_gainpct is a new NumPY array holding its own data, and that data is the computed return for every asset over the selected time period.

Finally, we get to the FOR loop:

# For each industry, build a list of the per-stock gains over the given window  
            unique_ind = np.unique(industries[0,])  
            for industry in unique_ind:  
                ind_view = asset_gainpct[industries[0,] == industry]  
                ind_mean = np.nanmean(ind_view)  
                out[industries[0,] == industry] = ind_mean

This is what I consider to be the "slickest" part of the factor. We start by creating a new NumPY ndarray called unique_ind which contains the unique industry codes from the view of industries[0,] which represents the oldest day of data for all assets. Then, we loop through those industry codes and do the following:

Construct a view of asset_gainpct which contains only the data for the assets who have the same industry code that we are currently working with
Compute the mean value of all the returns for the the currently selected industry code
Update the out variable with the mean, but only update the locations which represent an asset with the same industry code as the currently selected one

One of the interesting things about NumPY arrays is that you can evaluate a Boolean condition across an array to return a new Boolean array which can be used as the index for another array. So, in the code:

ind_view = asset_gainpct[industries[0,] == industry]

This first creates a view of industries[0,] which, as explained previously, is a view of the oldest day of industry codes for all assets. Across that view, it evaluates the Boolean condition == industry and generates a new ndarray which stores just Booleans describing how each index of the source view evaluated the Boolean condition. Thus, in indices where the industry matched, the value stored is True while in all other indices, the value stored is False. Using this new Boolean array, asset_gainpct is summarized into a view (which we named ind_view) and that view contains only the indices from asset_gainpct where the Boolean array stored a True value. So, when we get to our next line:

ind_mean = np.nanmean(ind_view)

We can use the view ind_view to calculate the mean of the asset_gainpct for all assets in the selected industry (and only those assets). The nanmean() function simply ignores entries which are not a number. The last line uses a similar trick to do the update of out:

out[industries[0,] == industry] = ind_mean

In this line, we are once again using a Boolean array to create a temporary view of out which only references the locations where an asset would be in the same industry and then we are assigning a single np.float64 value to all of those positions. It results in a sparse update of the out array.

Each time we complete the loop, we fill in one more industry worth of values into the out array.

James Christopher

Dec 31, 2015

Hey Eliot

I took a look at your factor and condensed down the code a little bit, hopefully you can adapt this to your project.

class IndustryMeanGain(CustomFactor):  
    inputs = [USEquityPricing.close, morningstar.asset_classification.morningstar_industry_code]  
    def compute(self, today, assets, out, close, industry_codes):  
        df = pd.DataFrame(index=assets, data={"cum_returns": (close[-1] / close[0]) - 1,  
                                              "industry_codes": industry_codes[-1]})  
        out[:] = df.groupby("industry_codes").transform(np.mean).values.flatten()

Disclaimer

Eliot Gable

Dec 31, 2015

Thanks! I was wondering how to eliminate that other FOR loop. :)

Though, I should mention that as it is written now, you can't do things like return the previous 4-week return of the industry to compare against the current 4-week return of the industry.

Eliot Gable

Dec 31, 2015

I tried to adapt your changes, but it doesn't seem to work:

def GainPctInd2(offset=0, nbars=2):  
    class GainPctIndFact2(CustomFactor):  
        window_length = nbars + offset  
        inputs = [USEquityPricing.close, ms.asset_classification.morningstar_industry_code]  
        def compute(self, today, assets, out, close, industries):  
            df = pd.DataFrame(index=assets, data={  
                    "gain": ((close[-1 - offset] / close[(-1 - offset) - (nbars - 1)]) - 1) * 100,  
                    "industry_codes": industries[-1]  
                 })  
            out[:] = df.groupby("industry_codes").transform(np.nanmean).values.flatten()  
    return GainPctIndFact2()

This returns:

ValueError: Length mismatch: Expected axis has 5379 elements, new
values have 7811 elements

On this line:

results = pipeline_output('my_pipeline')

Did I miss something?

Jamie McCorriston

Dec 31, 2015

It looks like np.nanmean is excluding all securities with securities with NaN as their industry code. The error is raised because the function passed to the transformation doesn't apply to all securities (hence, the length mismatch). You can solve this by using np.mean instead of np.nanmean. In general, pipeline errors always refer to the pipeline_output() line which is something that we need to improve. But if you want to see where an error truly occurred, try using the builtin debugger tool. In this case, I put a breakpoint on the line out[:] = df.groupby("industry_codes").transform(np.nanmean).values.flatten() and just played around with the code in the command line using functions like len().

I hope this helps!

Disclaimer

Eliot Gable

Dec 31, 2015

Thanks, that was indeed the issue. However, won't it screw up the calculations if NaN values are considered in the averages? By my reading, np.mean() will count the NaN elements in the number of values to average and yet the NaN values will not increase the running summation. Thus, the resulting value will be skewed considerably lower if there are lots of NaN values present. On the other hand, np.nanmean() will ignore the NaN values both in the summation and the count of values it's averaging. The result will be an average of the values which were not NaN as if the NaN values never existed in the first place. This seems like a more accurate way to compute the average of the values.

Jamie McCorriston

Dec 31, 2015

Eliot,

That was my reaction too but it looks like np.nanmean() is dropping the rows that have NaN industry codes. Using the debugger, I first tried:

len(df[df['industry_codes'].notnull()])

Which resulted in 5379. I then tried:
len(df[df['industry_codes'].notnull()].dropna()) Which yielded 5166.

So there seem to be 211 securities that have an industry code but have NaN values for cum_returns.

I did some experimenting and I think this code will get you what you want:

nans = isnan(df['industry_codes'])  
        notnan = ~nans  
        out[notnan] = df[df['industry_codes'].notnull()].groupby("industry_codes").transform(np.nanmean).values.flatten()  
        out[nans] = nan

The idea is that the NaN industry codes are first removed, and then nanmean is only excluding securities with NaN cum_returns. As a sanity check, I compared the above code with the np.mean equivalent (just replacing np.nanmean with np.mean), and indeed a few of the industry means are different.

Let me know what you think!

Disclaimer

Elie Mekanna

Dec 4, 2016

Hi,

I know its a bit late, just I recently came across your code. It did not display correct results for me, so I refactored it. Hope this helps!

You've successfully submitted a support ticket.

Our support team will be in touch soon.