Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
A question on statistical operation with the pandas data frame from get_fundamentals

I have the results of a get_fundamentals query

fundamental_df=get_fundamentals(Q)  

So now fundamental_df is a pandas frame with the symbols in rows and and the differnt fundamental fields in columns. Lets say I have "ebitda" and "total_revenue" fields in columns

I am trying to
1. calculate the avergage and standard deviation across all symbols for the "ebitda" and "total_revenue" columns
2. Eliminate from the pandas frame all rows/symbols that fall outside, say 3 standard deviations from the average

I am still learning how to manipulate/calculate with a pandas data frame.

Can some one show me how to do this?

Thanks,
Sarvi

5 responses

The attached backtest is an example with a lower cutoff (1 SD) and just 3 stocks for easier viewing. If you run it and check the logs, you'll see that XOM is eliminated because its total_revenue exceeds the positive cutoff, while GOOG_L is eliminated because its ebitda is less than the negative cutoff. AAPL survives the cutoff tests.

Thanks Michael, this was both useful and instructive.

It helped me understand the basics of the pandas data frame.

Except for how this piece of code.
Not sure I understand how this sets the values to NaN
Though I do see it happenning

# Replace fundamental values exceeding cutoff with NaN  
    marked_df = fundamental_df[np.abs(zs) < cutoff_z]  
    log.info('Marked with NaN: \n{mk}'.format(mk=marked_df))  

From what I can tell np.abs(zs) < cutoff_z produces True/False in the various fields,
But how does fundamental_df[np.abs(zs) < cutoff_z] produce NaN values?
What does indexing a pandas dataframe with another pandas dataframe do?
2014-03-03before_trading_start:46INFOnp.abs(zs) < cutoff_z: security Security(24 [AAPL]) Security(8347 [XOM]) Security(26578 [GOOG_L]) ebitda True True False total_revenue True False True

Sarvi

Also, when trying to apply this code segment to my fundamental_df I ran into a problem with the code

ValueError:  
operands could not be broadcast together with shapes (44928,) (16,)  

The only difference is that my Query has lot more stocks that match, 4000 or so

How can I apply to just some rows and not others.
For example, my fundamental_df may in the future also have "share_class_status" which is a value with strings "A"/"D" and does not lend well to arithmetic operations

When you index one pandas DataFrame with a second similarly shaped boolean DataFrame, locations that are False in the second are replaced with NaN in the first. Locations that are True in the second keep their original values in the first.

The ValueError was probably because your larger DataFrame included non-numeric fundamentals. As shown in the attached backtest, you can select numeric rows to work on separately:

    # Select numeric fundamentals for further filtering  
    numeric_df = fundamental_df.loc[['ebitda', 'total_revenue']]

After eliminating stocks that don't meet your numeric criteria, you can reunite the remaining stocks with their non-numeric fundamentals:

    # Get all numeric and non-numeric fundamentals for securities that  
    # passed tests, by selecting those securities from original df.  
    passing_df = fundamental_df[trimmed_df.columns]

Awesome. Works.

Thanks,
Sarvi