Pipeline API vs get_fundamentals discrepancy

Back to Community

edited Oct 18, 2015

I tried to port my algorithm to Pipeline API with no success. So I created this test case algo that tries to use PIpeline to select everyday securities with valuation.market_cap > 1e8 and valuation.market_cap <= 1e10.

This test case shows that Pipeline API returns different stocks compare to get_fundamentals. It's probably my mistake but I cannot figure out the problem.

8 responses

Dan Dunn

Oct 7, 2015

I think the problem is in line 22. You used filter1 and filter2 instead of filter1 & filter2. Line 22 is now

    mktcap_filter = (mktcap > 1e8) & (mktcap <= 1e10)

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Luca

Oct 7, 2015

Thanks for your help.
But why is still there a (very little) difference in the stocks returned by the Pipeline compared to get_fundamentals?

Scott Sanderson

Oct 7, 2015

@Luca I think the issue here is that you're writing your filter as

mktcap_filter = (mktcap > 1e8 and mktcap <= 1e10)

The correct way to write this expression is:

((mktcap > 1e8) & (mktcap <= 1e10))

Note: The parentheses around the subexpressions are necessary because of the precedence rules for the & operator.

You almost never want to use the word and in Python when operating with values that aren't booleans (i.e. True and False). The and operator in Python means "use the thing on the right if it's 'falsey', otherwise use the thing on the right". For example:

This happens to do what you expect/want when the values are booleans:

In [26]: True and False  
Out[26]: False

In [27]: False and False  
Out[27]: False

In [28]: False and True  
Out[28]: False

In [29]: True and True  
Out[29]: True

but it has somewhat confusing behavior if you're working with objects other than booleans:

In [24]: [] and 3  
Out[24]: []

In [25]: [1] and 3  
Out[25]: 3

Filters, like most Python objects, are truthy, which means that the expression (mktcap > 1e8 ) and (mktcap <= 1e10) evaluates to just (mktcap > 1e8), throwing away your other condition.

Some Examples:

In [43]: f = SomeFactor()

In [44]: f  
Out[44]: SomeFactor((USEquityPricing.close::float64,), window_length=1)

In [45]: f > 10  # The comparison operators create new "NumExprFilter" objects.  
Out[45]: NumExprFilter(expr='x_0 > (10)', bindings={'x_0': SomeFactor((USEquityPricing.close::float64,), window_length=1)})

In [46]: ((f > 10) and (f < 11))  # This is the same as the result above.  
Out[46]: NumExprFilter(expr='x_0 < (11)', bindings={'x_0': SomeFactor((USEquityPricing.close::float64,), window_length=1)})

In [47]: ((f > 10) and (f < 11)) is (f < 11)  # In fact, it's actually the same object, because the result of `(f < 11)` is cached.  
Out[47]: True

In [48]: ((f > 10) & (f < 11))  # Using the & operator yields the expression we actually want  
Out[48]: NumExprFilter(expr='(x_0 > (10)) & (x_0 < (11))', bindings={'x_0': SomeFactor((USEquityPricing.close::float64,), window_length=1)})

In [49]: ((f > 10) & (f > 10)) is (f > 10)  
Out[49]: False

Advanced Section:
Another interesting way to see the difference between & and and is to look at the bytecode that the Python interpreter generates for these expressions:

In [58]: from dis import dis  # dis is the disassembly module.  It's not available on Quantopian, but you can try this out locally in a shell.

In [59]: dis(compile("(f > 10) and (f < 11) ", "<str>", "eval"))  
  1           0 LOAD_NAME                0 (f)  
              3 LOAD_CONST               0 (10)  
              6 COMPARE_OP               4 (>)  
              9 JUMP_IF_FALSE_OR_POP    21  
             12 LOAD_NAME                0 (f)  
             15 LOAD_CONST               1 (11)  
             18 COMPARE_OP               0 (<)  
        >>   21 RETURN_VALUE

This representation of the bytecode is a little confusing at first, but it's actually pretty readable once you get used to it. The first three instructions say "Load the name 'f', then load the constant 10, then compare them with the > operator, and push the result onto a stack. The compiler then emits a conditional jump instruction JUMP_IF_FALSE_OR_POP, which says "Jump to the return statement if the left-hand value is falsey, otherwise throw away the left hand value and compute the right hand value". We then either go straight to the return, or we go to the instructions for f < 11, which then slides straight into the return.

We can compare this to the instructions generated when using &:

In [60]: dis(compile("(f > 10) & (f < 11) ", "<str>", "eval"))  
  1           0 LOAD_NAME                0 (f)  
              3 LOAD_CONST               0 (10)  
              6 COMPARE_OP               4 (>)  
              9 LOAD_NAME                0 (f)  
             12 LOAD_CONST               1 (11)  
             15 COMPARE_OP               0 (<)  
             18 BINARY_AND  
             19 RETURN_VALUE

You'll immediately notice that there are no JUMP instructions here. We unconditionally evaluate both the left and right hand sides and return the value produced by the BINARY_AND instruction, which is the opcode corresponding to the & operator.

End Advanced Section

Hope that helps!
-Scott

Disclaimer

Luca

Oct 7, 2015

Thanks Scott for the explanation, I was aware of that but I guess I am too tired and my mind refuses to work properly ;)

But even after I changed the code with your (and Dan) fix, I can see that Pipeline API are returning different results from get_fundamentals, even though (in my understanding) they should return the same stocks, as the filter applied is the same.

Please, have a look at the algorithm fixed by Dan, you can see a difference of about 8/10 stocks everyday from Pipeline and get_fundamentals. Also, if you run the backtest a little further, for the whole month of March, you can see this difference reach ~60 stocks.

Why is that?

Scott Sanderson

Oct 7, 2015

@Luca my guess is that the discrepancy here is due to the fact that the Pipeline API code will only return assets on days after an asset has traded at least once in our database, whereas the sqlalchemy-based fundamentals API is based solely on the data provided by Morningstar. So if morningstar provides entries for an asset before it actually starts trading according to our pricing data, there will be a discrepancy like the one you're seeing.

Disclaimer

Scott Sanderson

Oct 7, 2015

Thanks Scott for the explanation, I was aware of that but I guess I am too tired and my mind refuses to work properly ;)

No worries. You're not the first person to run into this issue, and the difference between & and and is fairly subtle, so I wanted to provide a complete answer to help others who might hit this problem as well.

Disclaimer

Scott Sanderson

Oct 7, 2015

Actually, looking at this again, it looks like you're getting more values out of the Pipeline API than out of the fundamentals API, so I don't have a good answer yet for what's causing the discrepancy here.

Disclaimer

Luca

Oct 14, 2015

@Scott

I was too curios to understand why we got this discrepancy so I created another backtest:

check the first 5 sids returned from Pipeline are the same first 5 sids returned from get_fundamentals. If so we are fine.
If those differ I keep looking for Pipeline sids in the following sids (after first 5) returned from get_fundamentals. This is to take into account what Scott said: " Pipeline API code will only return assets on days after an asset has traded at least once in our database". If they are not found, than it's a discrepancy that shouldn't be there.
If previous check if ok, I'll also check that differnt sids found in the first 5 results of fundamentals are not present at all in pipeline sids.

This new version shows that Pipeline still behaves differently than get_fundamentals.

You've successfully submitted a support ticket.

Our support team will be in touch soon.