Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Help me in creating quintiles from a Dataframe column

Hi!

I have a dataframe similar in the structure to the one in this notebook.

How can I output an analysis of just the elements in a column as Alphalens does, so with every quantile (in my case quintiles) lower and upper bound, the number of elements in each quantile, possibly a graph like Alphalens ?

6 responses

Something like this, to begin with:

min max mean    std count   count %  
factor_quantile  
1   -0.999794   0.227477    -0.253214   0.178138    922640  20.019381  
2   -0.407817   0.406284    -0.015098   0.113215    921293  19.990153  
3   -0.266352   0.611015    0.110669    0.112861    921294  19.990175  
4   -0.141869   0.946058    0.247604    0.131643    921292  19.990132  
5   -0.005262   33.430894   0.677265    0.652337    922215  20.010159  

So here's a little update, am I doing right?

The Number of elements column is not right, instead of

([2], [1.222, 2.6162460000000003]) I would like to get just
2

Good question! First, add a column to the dataframe which indicates which quintile each row belongs to

df['quintile'] = pd.qcut(df.factor, q=5, labels=[1, 2, 3, 4, 5])

Next, group on the quintile and use the aggregate or agg method to apply any calculations to each group. Put the names of the functions to apply in a list.

df_stats = df.groupby('quintile').factor.agg(['min', 'max', 'mean', 'std', 'count'])

The 'count %' function is a little special. It relies on knowing the sum of all the counts to get a percentage. The group functions only know about the factor values in their group and not the total. The simplest way to get the percent is so do this as a separate step. Note that one must use bracket, and not dot, notation because 'count' is also a method name.

df_stats['count %'] = df_stats['count']/df_stats['count'].sum()

See the attached notebook. Hope that's what you were looking for.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Dan, you always amaze me!
Thank you

sorry to disturb you again for a very noob question, can't we put the code in a function so that I can pass a dataframe and a column and it works?
Something like:

def quintile(df,column):  
    df['quintile'] = pd.qcut(df.column, q=5, labels=[1, 2, 3, 4, 5])  
    # First apply the functions which only rely on the groups  
    df_stats = df.groupby('quintile').column.agg(['min', 'max', 'mean', 'std', 'count'])  
    # The 'count %' stat relies on knowing the sum of all the counts so do this as a separate step  
    # Note that one must use bracket, and not dot, notation because 'count' is also a method name  
    df_stats['count %'] = df_stats['count']/df_stats['count'].sum()  
    df_stats  

if I have a dataframe called "df5" and it has a column with all numbers such as factor called "factor3" I would like to run
quintile(df5,factor3)

and have just this printed:
```

min max mean    std count   count %  

quintile
1 1.22200 2.44000 1.83100 0.861256 2 0.285714
2 3.32123 3.32123 3.32123 NaN 1 0.142857
3 6.88800 6.88800 6.88800 NaN 1 0.142857
4 7.84423 7.84423 7.84423 NaN 1 0.142857
5 13.00000 19.00000 16.00000 4.242641 2 0.285714
```

Always glad to help out @ Emiliano. It's good to know the pandas methods. Even if one doesn't remember exactly how they work, just knowing that things like groupby and agg exist is a start. I often can't recall the specifics so I google something like 'how to groupby with multiple functions in pandas 0.18'. My usual 'go to' links in the results are the pandas docs and stackoverflow. I add '0.18' in the search to get the docs for pandas version 0.18 which Quantopian currently uses. It's frustrating to find a solution only to realize it doesn't work because it relies on a more current version of pandas.

Also, I just noticed you asked about plotting. Not sure what you want to plot, but I often look at the distribution of factor values using the hist method. It really jumps out if the factor values are not evenly distributed or there are big outliers.