Calling Multi Index Attributes after reducing a Dataframe with Iloc

Back to Community

posted Jan 7, 2019

Hi all,

Just a quick question i came up against starting to look through Quantopian and I was wondering where i was going wrong.

I am trying to return the top 500 companies by market cap, having excluded all non primary share companies.

Now , i am aware there might be a better way to do this involving a more advanced screen filter than what i've used, however i'd still like to know where the problem i'm facing arises from.

In essence what i'm doing is running a very simple pipeline, screening to include only primary shares.
After that, i run a sort by on my market cap column and then Iloc[:500,:] to slice my dataframe to just the top 500 assets by Mcap.

The odd thing is that from there, when i try and run df.index.levels[1] (to get a list of the Sid which i use to create a name column in my frame) the index return is that of the original DataFrame, not the sliced new one. I've tried creating a copy by as the dataframe is considered a function, but also to no avail

Optionally, here are my imports :

import pandas as pd  
import numpy as np  
import datetime as dt  
import matplotlib.pyplot as plt  
#Pipeline is the screening engine  
from quantopian.pipeline import Pipeline

#Importing Datasets  
from quantopian.pipeline.data import USEquityPricing as P_us  
from quantopian.pipeline.domain import US_EQUITIES  
from quantopian.pipeline.filters import fundamentals as Q_FF

# PARAMETERS TO  
from quantopian.pipeline.domain import BE_EQUITIES  
# from quantopian.pipeline.data import EquityPricing as P_int  
from quantopian.pipeline.filters import QTradableStocksUS

#Import Quantopian Functions  
import quantopian.pipeline.factors as QT_factors

# Run pipeline is the Screen Iterator function  
from quantopian.research import run_pipeline

# Import Alphalens , Quantopian's Factor analysis & plotting module  
import alphalens as al

Final= (run_pipeline(  
    Pipeline(  
        columns={"market_Cap":QT_factors.MarketCap(),  
                 "return":QT_factors.DailyReturns()},  
        screen=Q_FF.IsPrimaryShare()  
        ),  
        start_date='2019-01-04',  
        end_date='2019-01-04')  
         ).sort_values(by='market_Cap',ascending=False).iloc[:500,:]

From there if you run Index you get a len of 500 , which is correct, but try running index.levels and you get 4200+

len(Final.index)  
len(Final.index.levels[1])

#try a copy ,same problem 

df=Final.copy()  
len(df.index.levels[1])

Any idea if this is a bug or a mistake in how i am approaching this ?

Also if there is a version of pipeline's screen parameter that allows me to do this, i've be very grateful to be guided to the doc, but as i have not looked much into it myself, i'm asking for help on the above specific issue , anything further is a bonus !

Many thanks, let me know if you need any further information to replicate.

Best,

10 responses

Kyle M

Jan 8, 2019

I believe it you're using iloc notation incorrectly, although I didn't actually test your code. iloc takes two arguments, the start index and the end index. I think what you're looking for is iloc[0, 499].

You can also accomplish what you're looking for using a filter like you thought! :)

I recommend something like the following:

market_cap = QT_factors.MarketCap()  
Final= (run_pipeline(  
    Pipeline(  
        columns={"market_Cap":  market_cap,  
                 "return":QT_factors.DailyReturns()},  
        screen=(Q_FF.IsPrimaryShare()  & market_cap.top(500))  
        ),  
        start_date='2019-01-04',  
        end_date='2019-01-04')  
         )

Clockworkss

Jan 8, 2019

Hi Kyle,

Many thanks for the answer, i've unfortunately tried both these avenues. Using your code, i get less than 500 companies, getting 354 instead. Reason is the filters are not applied in turn , the filter is saying " i want only the primary shares from the top 500 security ranked by MCap", instead of , I want the top 500 companies , only counting primary shares "

I tried running something like (Q_FF.IsPrimaryShare() & market_cap.top) .top(500) but this work as the result of the filter is a boolean .

In regard to iloc, my method of implementation i believe is not wrong, at least in terms of the functioning of the function. The iloc method can be used as:

iloc[start_row_n:end_row_n,start_col_n:end_col_n] with omitted start X defaulting to 0.
As my Frame's size is 500 , the iloc is performing as it should, rather its the index method that is not.

hope this helps, thanks for taking the time to help !

Vladimir

Jan 8, 2019

Top 500 companies by market_cap, only counting primary shares in IDE:

from quantopian.algorithm import attach_pipeline, pipeline_output  
from quantopian.pipeline import Pipeline, factors, filters  
from quantopian.pipeline.data import Fundamentals

def initialize(context):  
    schedule_function(trade, date_rules.every_day(), time_rules.market_open(minutes = 65))  
    ps = filters.fundamentals.IsPrimaryShare(mask = filters.QTradableStocksUS())  
    mc = Fundamentals.market_cap.latest  
    top_market_cap_primary = Fundamentals.market_cap.latest.top(500, mask = ps)  
    attach_pipeline(Pipeline(columns = {'market_Cap': mc}, screen = top_market_cap_primary), 'pipe') 

def trade(context, data):  
    output = pipeline_output('pipe')  
    top_mc = pipeline_output('pipe').sort_values(by = 'market_Cap', ascending = False).iloc[:500,:]  
    print (len(output), len(top_mc))

Clockworkss

Jan 8, 2019

Hi Vlad, thanks for the above!

Being new to quantopian, i've not worked in the backtest IDE yet, tried to run your script i don't see a console to execute. I can execute and see the result but not the dataframe as using the "build algo" button crashes from trying to import fundamentals for some reason.

Anyway, as the print above works, i tried printing the index length at level 1 ( the number of companies) , the dataframe has no level one, so the trade function seems to iterate over the level 0 (date) index of your frame, meaning i can't validate the theory above as it has no attribute .index.levels[1].
You can verify by trying to print ( len(output.index.levels[1]))

However , you answer gave me the best practice, as i didn't know about the MASK parameter. indeed if i use it , i get the desired Dataframe.

I'm happy to proceed as above but would still like to know.

Why is the attribute behavior different , this looks like a bug, can anyone explain ?
i'm attaching a proof of the difference in behavior below.

Final= (run_pipeline(  
    Pipeline(  
        columns={"market_Cap":QT_factors.MarketCap(),  
                 "return":QT_factors.DailyReturns()},  
        screen=Q_FF.IsPrimaryShare()  
        ),  
        start_date='2019-01-04',  
        end_date='2019-01-04')  
         ).sort_values(by='market_Cap',ascending=False).iloc[:500,:]

market_cap2= QT_factors.MarketCap()  

Final2= (run_pipeline(  
    Pipeline(  
        columns={"market_Cap":  market_cap2,  
                 "return":QT_factors.DailyReturns()},  
        screen=market_cap2.top(500, mask=Q_FF.IsPrimaryShare())  
        ),  
        start_date='2019-01-04',  
        end_date='2019-01-04')  
         )  
len(Final2)  
Final2.sort_values(by='market_Cap',ascending=False).head()

Final.head()  
assert(Final.equals(Final2.sort_values(by='market_Cap',ascending=False)))  
assert(Final.index.levels[1].equals(Final2.index.levels[1]))  
print(len(Final.index.levels[1]),len(Final2.index.levels[1]))

Jamie McCorriston

Jan 8, 2019

Clockworkss,

First off, welcome to Quantopian! It looks like you're off to a good start with your pipeline. Vladimir's suggestion to use masking is a good one. Masking allows you to achieve "i want only the primary shares from the top 500 security ranked by MCap" as you stated earlier. Attached is a notebook with a pipeline that takes the top 500 assets in the QTradableUniverse (a built-in filter) every day. I attached a plot at the end of the notebook to show that there are 500 securities per day. You can play around with the notebook and change the filter to IsPrimaryShare to see that the pipeline still outputs 500 assets per day.

Here are a couple of tips to help accelerate your learning curve on Q:
1. If you haven't seen it yet, the Getting Started Tutorial will give you a quick walkthrough of the various tools available on Quantopian. This is a great place to get started learning the platform.
2. Next, I'd highly recommend going through the Pipeline Tutorial. Lesson 7 in particular covers masking in more detail. Given the questions you are asking, I think your time will be well spent if you go through the whole tutorial.
3. When asking for help in the forums, attaching a notebook or backtest is the best way to share code. That way, others can simply click "Clone" to pick up where you left off and play around with things.

I hope this helps! Let me know if you have any further questions about the attached notebook.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Clockworkss

Jan 8, 2019

thanks for the resources, will definitely check it out! and I am able to achieve my desired pipeline from the help i received from vlad and yourself , I am now more trying to understanding the behavior so that I know what can and can't be done.

Would you have any idea why calling .index.levels[1] returns the original index when using .iloc as described above?

Jamie McCorriston

Jan 8, 2019

The result of running a pipeline in research is a MultiIndex where level 0 is the date and level 1 is the asset. In the IDE, attached pipelines are run once per day, and the output (retrieved via the pipeline_output function each day) is just a regular Index (assets), since the date is implied to be the simulation date.

Does this help?

Disclaimer

Clockworkss

Jan 9, 2019

Hey all,

thanks for all the help and advice so far, apologies for not attaching a note book, please find it herein.
Also it seems my question is not clear. I understand how to achieve what i wanted, i have a functional notebook thanks to the mask filters suggested above.

My questions is:
why does the assertion below return false when the data frames are the same as proven by the first assertion in the notebook ?

assert(Final.index.levels[1].equals(Final2.index.levels[1]))  
print(len(Final.index.levels[1]),len(Final2.index.levels[1]))

Many thanks !

Jamie McCorriston

Jan 9, 2019

Hi Clockworkss,

Thanks for clarifying your question in the notebook - I now understand what you're asking.

< strikethrough> I'm far from an expert on pandas/numpy, but it looks like what's happening is the index of the DataFrame doesn't change when you take a subset with iloc. Once upon a time, one of our engineers was explaining another problem related to MultiIndex DataFrames to me and they pointed out that the Index object is an iterable of unique values that actually map the larger objects (in this case, Equity objects) to integers. The integers are used under the hood to represent the data and reduce the size of the structure. The Index kind of serves as a map from the user-facing index to the integer representation. My guess is that it's not worth it for the index in Final to be re-mapped to a new set of integers because all of the index data (the integers) would have to be re-written to reflect the new map. At a high level, the contents of the two DataFrames are the same, but the Index label mapping to integers are different.

Again, this is a bit of a guess, but I'll call it an educated guess! I'd be interested to hear from someone who knows more about the topic. < /strikethrough>

EDIT - I asked one of our engineers who answered with this: "The index is actually different, but the levels are not. A MultiIndex is represented as an array of unique values along each level (levels) and an array of coordinates into those arrays (values I think). When taking a subset of the index, it doesn't alter the levels at all, only the coordinates. This is why you should use get_level_values to actually materialize a level."

I hope this helps!

Disclaimer

Clockworkss

Jan 10, 2019

Hi Jamie,

thank you so much for the help and following up with your engineering team on this ! really helpful !
looking through i can see that get_level_values does indeed work and indeed i can replicate this issue on a pandas only frame . i'll use the answer they provided and do a bit of reading to understand the specific but that's it from me !

thanks to all who helped both with the best practice, and understand why the method i employed did not work !

best,

You've successfully submitted a support ticket.

Our support team will be in touch soon.