Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
How to split context.stocks into two equal sized arrays ?
API

Hey ,

I am choosing my stocks through a fundamental query and it returns me back a pandas dataframe containing say N stocks , i would like to split this into two equal sized arrays or dataframes . One for long side and the other for short , how should i go about it for even as well as odd values of N ?

20 responses

Yatharth,

If I'm understanding what you are trying to do (ranking and the longing/shorting the ends) then there are really simple Pandas methods to help. This is actually a pretty common and highly encouraged strategy as it is often a low beta strategy. The three big Pandas methods you should look at are :
sort() or rank(), head(), tail() So here is a little sample

# df will be a dataframe with columns of what we want to rank upon  
ranks = df.sort(["buy_score"])  
to_long = ranks.head(10)  
to_short = ranks.tail(10)

and now you have two equal sized groups of assets to long and to short.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

In general, for any nonnegative integer n, you would write

to_long = ranks.head(n/2)  
to_short = ranks.tail(n-n/2)  

which uses the same integer division (discarding the remainder) that dates back to Fortan and C: if you take 7/2==3 for the head (top part), you need 7-7/2==4 for the tail (bottom part).

@ Yatharth

@ Andre makes a very important point about integer, also called floor, division. In Python there are two types of division: integer division and true division. Quantopian uses Python 2 for everything, Python 2 uses integer division. If you find yourself using Python 3 you will be using true division.

# Python 2.x  
>>> 1 / 3  
0

# Python 3.x  
>>> 1 / 3  
0.3333333333333333

@James - Good point: Python 3 sets the opposite pitfall, in that "regular" division of two integers produces a floating-point number. Is Quantopian considering moving to Python 3?

The "regular" pitfall leads to code like this:

context.stocks = symbols('SPY', 'TLT') # a list of Security objects  
number_of_stocks = len(context.stocks) # 2  
allocation_fraction = 1 / number_of_stocks # intended 1/2, but got 0 because of integer division  
print "allocation fraction: {}".format(allocation_fraction) # 0, not 0.5  
for stock in context.stocks:  
    order_target_percent(stock, allocation_fraction) # buys nothing  

Python 2 will be here for the foreseeable future. The fragmentation of Python into those two sects is sad, though I think it's better than ending up with a bloated language that is constantly trying to hard to be backwards compatible.

@ Yatharth
If you want to use true or float division by default in Quantopian from __future__ import division at the top of your script and you are good to go.

The standard non-Pythonic workaround to the standard integer-division pitfall is to make sure one of the division operands is floating-point. In the example above, using 1.0 instead of 1 does it. Then the other operand is silently converted to floating-point, floating-point division is performed, and the right fraction of the portfolio value is allocated to each asset.

This wouldn't help Yatharth, because the head and tail methods of pandas.DataFrame require integer arguments.

Considering where this thread has gone, we probably stopped helping him a while ago ;)

If he doesn't understand it now, he will in time. I also hope some others will benefit.

@James
@Andre
Thanks a lot for the help. Yes I do understand the difference between the two and now it makes sense why people are using 0.99/len(context.stocks) to calculate weights rather than 1/len(context.stocks)

Also in the particular the problem I am facing requires me to get an integer rather than a floating point number ,so integer division works.

But I am facing yet another problem of using the sort function. I get a Keyerror on this line of code : ranks = context.stocks.sort(['market_cap'])
KeyError: 'market_cap'

You are getting a key error because in the DataFrame returned by get_fundamentals the securities are the columns so you can just transpose the DataFrame context.stocks.T and then run the sort.

Yes, I was wondering if we needed to transpose. We do. The DataFrame returned by get_fundamentals has columns representing securities and rows representing their properties like market_cap, but the sort, head and tail methods rearrange and select rows. So we need

ranks = context.stocks.T.sort(['market_cap']) # transpose sorted  
n = len(context.stocks.columns) # number of stocks  
to_long = ranks.head(n/2)  # the smaller-cap half  
to_short = ranks.tail(n-n/2)  # the rest  

Correct, James?

So Andre you don't need to do n - n/2 if you want them to be equal sized and each to contain half the of the securities, then just set both head and tail with n/2

Here is a valid way:

ranks = fundamental_df.T.sort(["market_cap"])  
num_to_get = len(ranks.index) / 2 # number of stocks  
to_long = ranks.head(num_to_get)  # the smaller-cap half  
to_short = ranks.tail(num_to_get)  # the rest  

This will ensure that to_long and to_short will always be the same size.

@James - I think Yatharth meant splitting into two non-overlapping parts covering the given set of stocks and as close to equal size as possible. If he were OK with taking 3 to buy and 3 to short out of 7 (which is what your code would do), he wouldn't have asked about the code working for both even- and odd-sized sets, would he?

@Yatharth - What do you want to do if your fundamental_df has 7 stocks: buy the smallest 3, short the other 4? Or buy the smallest 3, short the largest 3, and ignore one?

Since I am hoping to work with a large data set it won't matter much if I lose out on one stock. But I would definitely want to cover all the stocks in case of a smaller set. I guess that's easy now , I will just check if N is odd or even and in case of odd I can take (N-1)/2 for long and ((N-1)/2)+1 for short or vice versa.

@Yatharth - So you haven't decided which "half" of an odd n should be larger, the long one or the short one. (Myself, I would go more long than short, as reflected in the code below. Anybody else?). But at least we know that you don't want to ignore any, so your "halves" should add up to n. So:

ranks = context.stocks.T.sort(['market_cap']) # transpose sorted  
n = len(context.stocks.columns) # number of stocks  
n_half_or_less = n/2  
n_half_or_more = n-n_half_or_less  
to_long = ranks.head(n_half_or_more)  # the smaller-cap half  
to_short = ranks.tail(n_half_or_less)  # the rest  

And how would you check whether a given integer is odd or even? Pascal has a built-in function named odd, which takes one integer argument and returns true if it's odd and false if it's even. In C and other languages, including Python, which allow an integer expression where a logical truth value is needed (with an implicit conversion of 0 to False and anything nonzero to True), we can use bitwise arithmetic and define

def odd(n):  
    """Returns 1 if n is odd, 0 if even; can be used in logical expressions as if it returned True and False, respectively."""  
    return n & 1  

With the above definition of odd, we can then write

ranks = context.stocks.T.sort(['market_cap']) # transpose sorted  
n = len(context.stocks.columns) # number of stocks  
if odd(n):  
    n_long = (n-1)/2  
    n_short = (n+1)/2   # or vice versa  
else:  # even n  
    n_long = n_short = n/2  
to_long = ranks.head(n_long)  # the smaller-cap half  
to_short = ranks.tail(n_short)  # the rest  

That's great Andre , this makes the code a lot more efficient/robust.

I don't know if the code with the if odd(n) statement is more efficient or more robust - probably neither.

First, you now have two branches you have to worry about, and twice as many statements you can introduce bugs into. Also, if you decide to calculate n_long and n_short differently, you can then unintentionally write code that behaves differently for odd and even n.

Second, code with branches may run more slowly if it's too large and doesn't all fit in the cache.

I prefer to write code, as in my example above, that works in both cases. It costs more time to write it, and something like a proof of correctness, but I have what it takes, and once I have the code, I can use it and rely on it without further ado. Shorter code is also easier to copy and paste.