Faster Fundamental Data

Back to Community

posted Aug 25, 2017

Today, we launched an improved version of Morningstar fundamental data in Pipeline. The new implementation is faster and corrects many data issues that existed in the old system.

Fundamental queries in Pipeline are 2-20x faster. The biggest improvements will be noticed in fields that update less often (monthly, quarterly, etc.) and in queries that use many different fields.

Performance Table

Queries will also be more memory efficient and more consistent in how long they take to complete. There are some queries that are not faster yet, but we are actively working on improving these. Most notably, the Q1500 and Q500 are slower in the new implementation; these should be much faster soon.

In addition to performance improvements, there are changes to the underlying data. The new implementation includes a large number of corrections across many assets and many fields.

How can you use the new implementation?

The new version is accessible in Pipeline via a new namespace. The following example demonstrates how to get the operating_income field in both the new and old formats:

# New way: Fundamentals.my_field:  
from quantopian.pipeline.data import Fundamentals  
operating_income = Fundamentals.operating_income.latest

# Old way: morningstar.category.my_field  
from quantopian.pipeline.data import morningstar  
operating_income = morningstar.income_statement.operating_income.latest

Built-in pipeline terms that use fundamental data also have a new module:

# New way: classifiers.fundamentals  
from quantopian.pipeline.classifiers.fundamentals import Sector  
sector = Sector()

# Old way: classifiers.morningstar  
from quantopian.pipeline.classifiers.morningstar import Sector  
sector = Sector()

The attached notebook has more examples that use the new namespace. The notebook also includes examples of the performance and data correctness improvements that come with the new system.

What does this mean for you?

Until the end of September, both the new and old implementations of Morningstar fundamental data will be available in Pipeline. Over the next month, you should compare the old and new data. Try running your research notebooks and algorithms on the new data to understand the impact that any changes in the data might have on your work.

At the end of September, the old namespace will be deprecated and redirected to point at the new data. If you don’t manually update your notebooks or algorithms to use the new namespace, they will automatically start to use the new data. Of course, it is advisable to test the impact ahead of time.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

125 responses

Leo M

Aug 25, 2017

Hi Jamie,

With regard to "The new implementation includes a large number of corrections across many assets and many fields.", is there a change log that can be made public or included in this thread. Some morningstar fundamental data (for instance: peg_ratio) weren't available all the way back to 2003. Have those missing data been corrected.

I am very impressed with the speedup numbers that you reported.

Best Regards,
Leo

Luca

Aug 25, 2017

Thanks for the NB Jamie, very neat. 2 questions for you :)

1 - Is it safe to mix old Q1500US and new fundamentals to have the best performance?

2 - out of curiosity, why did you change the hierarchy inside the fundamentals ( i.e. Fundamentals.operating_income vs morningstar.income_statement.operating_income ) ? It's now nicer to type, but it's harder to switch between old and new version. If you kept the same hierarchy one could simply change the import statement to switch between old/new way

Costantino

Aug 26, 2017

Thanks a lot for the faster API, Jamie!
And thanks also for having already migrated my Piotosky score implementation.

I've some questions:

Now that the API is faster, it's possible to use a longer windows_lenght without incurring in timeout. Will it remain the only method to use fundamentals for past quarters or different timeframe?
I'll already posted the following two questions per E-Mail but I think it's relevant for all.
Do the data corrections regard also the old get_fundamentals() API?
What about the future of this old API? It's confirmed, it will be deprecated?

Joe Jevnik

Aug 26, 2017

Thank you all for the positive feedback! I am going to try to answer your questions as they appear in this thread.

Leo M.

Some morningstar fundamental data (for instance: peg_ratio) weren't available all the way back to 2003. Have those missing data been corrected.

Unfortunately we still do not have a long history for peg_ratio. We have only changed how we process the data that Morningstar provides, and they do not provide the history for this field.

Luca

Is it safe to mix old Q1500US and new fundamentals to have the best performance?

It is safe to mix any of the terms from the old API with the new API. For now, using the old Q1500US will be faster but you may want to test that your algorithm behaves okay with the new data. We should have the cache ready for the new Q1500US soon which will make it just as fast as before.

why did you change the hierarchy inside the fundamentals?

We discussed this change a lot, but we decided that it made it easier for new users to find fields or for experienced users to find less commonly used fields. When you are in the notebook and type morningstar., you only see the sub-namespaces, not the names of the fields you are looking for. Now, you no longer need to search all 13 sub-namespaces to find the one field you are looking for. I hadn't considered that it makes it much more tedious to port your algorithms to the new API. If you have Python installed locally, you can automatically change all of your: morningstar.subnamespace.fields into Fundamentals.field by copying your algorithm locally to a file and running this on the command-line:

python -c "import re, sys;print(re.sub(r'morningstar\..*\.([_a-zA-Z0-9]*)', r'Fundamentals.\1', open(sys.argv[1]).read()))" path/to/algorithm/file

Be sure to replace path/to/algorithm/file with the actual path to your algorithm file. This command uses a regular expression to replace all of the occurrences of morningstart.thing.field with Fundamentals.field and then prints the result. I realize this is not an ideal situation but hopefully this makes it easier to switch over.

Costantino

Now that the API is faster, it's possible to use a longer windows_length without incurring in timeout?

This is correct, you should be able to use much larger windows across many fields in your backtests and in research.

Will it remain the only method to use fundamentals for past quarters or different timeframe?

For now have not planned any API changes around longer look backs, though it is something we have talked about. We would need to spend some more time thinking about what that API would look like.

Do the data corrections regard also the old get_fundamentals() API?

We have not updated the get_fundamentals() API. We will be removing this API in around a month. The main reason for keeping this after pipeline was introduced was that it was faster for some use cases. We believe that we have addressed that with this performance improvement to the pipeline API.

What about the future of this old API? It's confirmed, it will be deprecated?

As I mentioned above, the get_fundamentals() function will be removed. The old morningstar.namespace.field API will switch over to the new system which will inherit the new performance improvements and data corrections. The reason we have made this "opt-in" for now was to give the community time to evaluate the new data and help us find any issues before we made the switch for everyone.

Thanks again, I am happy to answer any more questions regarding this change!

Disclaimer

Luca

Aug 27, 2017

@Joe Jevnik, thank you for the detailed reply. One more question, if you don't mind. Do default_us_equity_universe_mask or make_us_equity_universe have some internal caching implemented the same way as Q500US/Q1500US have? I usually like to set a variable universe size via make_us_equity_universe, but if Q500US/Q1500US are much faster then I will try to stick to them.

Joe Jevnik

Aug 27, 2017

The three currently cached terms are: Q500US(), Q1500US(), and default_us_equity_mask(). If any of these terms get used in a pipeline, the results will be read from a pre-computed file instead of computed on the fly. The reason we can't cache make_us_equity_universe is that each invocation returns a new term. We would need to pre-compute the results for all possible inputs which is not possible. I should also note that the caching does not work if you pass a custom minimum market cap to any of the cached terms because we have only pre-computed the results for the default values.

This isn't to say that make_us_equity_universe cannot be used, in fact, it is now much faster than before. The same is true of Q(1)500US and default_us_equity_mask with non-default minimum market cap values. Hopefully this helped explain how the cache works!

Disclaimer

Luca

Aug 27, 2017

Thanks a lot, it makes sense.

Luca

Aug 27, 2017

I guess this is the easiest way to switch between old and new implementation then:

# Right at the top of your algo/NB  
new_way = True

if new_way:  
    # New way: Fundamentals.my_field:  
    from quantopian.pipeline.data import Fundamentals as income_statement  
    from quantopian.pipeline.data import Fundamentals as balance_sheet  
    from quantopian.pipeline.data import Fundamentals as valuation_ratios  
    from quantopian.pipeline.data import Fundamentals as operation_ratios  
    from quantopian.pipeline.data import Fundamentals as valuation  
    from quantopian.pipeline.classifiers.fundamentals import Sector  
    #from quantopian.pipeline.filters.fundamentals import Q1500US, Q500US //not cached yet  
    from quantopian.pipeline.filters.morningstar import Q1500US, Q500US  
else:  
    # Old way: morningstar.category.my_field  
    from quantopian.pipeline.data.morningstar import income_statement  
    from quantopian.pipeline.data.morningstar import balance_sheet  
    from quantopian.pipeline.data.morningstar import valuation_ratios  
    from quantopian.pipeline.data.morningstar import operation_ratios  
    from quantopian.pipeline.data.morningstar import valuation  
    from quantopian.pipeline.classifiers.morningstar import Sector  
    from quantopian.pipeline.filters.morningstar import Q1500US, Q500US

Jim Obreen

Aug 27, 2017

Hi Joe and Jamie,

I'm unable to import some of Morningstar's grade data when using the new query syntax-- check out the examples below.

Please let me know how I should be adjusting my code or if there's something wrong on the backend.

Thanks!

James

Kevin Stevens

Aug 28, 2017

This is wonderful! I found that many algorithm ideas I had attempted could not run due to memory and execution time timeouts.

Question though- is there anything in the works to make it a bit less awkward to work with quarterly and other infrequently updating data? Right now the best method I know of is creating a huge window length, then essentially looking back 252/4+1 days to get the last quarterly update. A massive data structure is created, when I just need 1/64th of those values. I had an idea to do the lookbacks in chunks upfront that wouldn't violate limits, then populate them down into my own much smaller structure stored in the context, but I never got around to it, though it seems like a problem that anyone looking to implement a fundamentals-based algorithm with lookbacks is going to encounter (I have commented on a few posts to help when users first try to figure out how to do this).

Costantino

Aug 28, 2017

First of all my compliments to Jamie and Joe and the other Q developer for this big improvement. I get no timeout error more, even in algo that load 2 or more years of data, like the one described here:
https://www.quantopian.com/posts/error-in-fundamental-data#56fbfe939426be39440000a4

Nevertheless I completely agree with Kevin's comment. It's exactly the reason, why I posted the question above "Will it remain the only method to use fundamentals for past quarters or different timeframe?".... and at least for the moment, the answer is yes.

Sometime ago I proposed to add the possibility to pass only the needed data indexes instead of the complete windows length:
https://www.quantopian.com/posts/pipeline-api-feature-request-retrive-single-data-points-instead-of-a-full-array-of-data-window-length

Fundamental data are now much faster, but if you have to load two years or more of data in a backtest, it still takes a while: maybe a similar change could make fundamental or other quarterly or infrequently updating data even faster!

Ian Worthington

Aug 29, 2017

basic_eps and dividend_rate just for starters are missing.

Ian Worthington

Aug 29, 2017

Can we please get access to Sklearn.neural_network and/or update sklearn.
I noticed you guys downgraded it from 18 -> 16 for whatever reason.

Donny Van Acker

Aug 29, 2017

Jamie, Joe, big thank you for the new API!

The improvement in speed is hugh! Something that took 170 seconds in research now only takes 6 seconds!

But I noticed something weird:

To get quarterly data, I used lookbacks with lengths of 1, 65, 130 and 195 days.
With the new API these lengths gave me some duplicate values for the quarters.
I had to change them to something like 1, 90, 160 and 230 to get the same data as with the old API. Why is that?

Anyone else experiencing the same? Be warned I should say!

Up until this moment, I noticed some small changes in Enterprise Value. Those changes decreased my best return with a third :-(.

Leo M

Aug 29, 2017

@Jamie, I have to concur with Donny that backtest results are not identical when pure text switching from (old format) morningstar.* to (new format) Fundamentals.* likely because of the same reason as Donny pointed out that maybe the offsets are different now. Could you please run some data consistency checks for past values when pulling with old and new format. I can provide logs if necessary, please let me know.

Regards,
Leo

Costantino

Aug 29, 2017

@Leo M The backtest results are different because the new system has many data corrections. In my case the returns decreased roughly an half.
It would be interesting to know, what was exactly the nature of the corrections and if other users experienced an increase in performance.

@Donny I use also the indexes [ -1, -65, -130, -195] for TTM data. I tested some symbols for operating_income but got no duplicate (see the attached notebook). Could you please make some examples?

Costantino

Aug 29, 2017

To simplify the use of TTM (Trailing Twelve Months) data I create the following utility classes:

def make_pipeline(context):  
    quarter_length = 65  


    class TTM(CustomFactor):  
        """  
        Trailing Twelve Months (TTM) current year  
        """  
        window_length = 3 * quarter_length + 1  
        def compute(self, today, assets, out, data):  
            ttm = np.array([-1, -quarter_length, -2 * quarter_length, -3 * quarter_length])  
            out[:] = np.nansum(data[ttm], axis=0)


    class TTM_PY(CustomFactor):  
        """  
        Trailing Twelve Months (TTM) previous_year (PY)  
        """  
        window_length = 7 * quarter_length + 1  
        def compute(self, today, assets, out, data):  
            ttm_py = np.array([-4 * quarter_length, -5 * quarter_length, -6 * quarter_length, -7 * quarter_length])  
            out[:] = np.nansum(data[ttm_py], axis=0)  


    return Pipeline(  
        columns = {  
            'ps_ratio_ttm': Fundamentals.market_cap.latest / TTM(inputs=[Fundamentals.total_revenue]),  
            'ps_ratio_ttm_py': Fundamentals.market_cap.latest / TTM_PY(inputs=[Fundamentals.total_revenue])  
        },  
        screen = Q1500US()  
    )

Leo M

Aug 29, 2017

I'd recommend we follow an exhaustive testing procedure of pulling each fundamental at various intervals using old and new format and make sure they are same. If we correct data and change the data retrieval in the same release, it will be difficult to debug why performance is different and cause large scale disruptions to algos across the board. Hence I recommend releasing it in phases. Phase 1: change method, Phase 2: correct data.

Leo M

Aug 29, 2017

If performance of algos change then it will be a difficult proposition as we will have to change the algo and lose the X months of out of sample with the old format, hence I recommend we don't switch morningstar.* to Fundamentals.*.

Instead if users want to make use of the new format in Fundamentals.* it will be better to start from scratch and write a new algo (thus preserving the performance and out of sample record of the old algo with the existing data (and format morningstar.*).

Redirecting morningstar.* to Fundamentals.* when there will be performance change to existing algos will be difficult to deal with, as opposed to a change that only increases speed but gives back the same data and preserves algo performance.

I think there is a larger issue with the evaluation process : if you allocated capital using performance and tearsheets with morningstar* assuming performance X and then redirect morningstar* to Fundamentals* where the performance is actually Y (and maybe not worthy of allocation) or might produce unexpected results after allocation than what was predicted during evaluation (X from morningstar*).

Donny Van Acker

Aug 30, 2017

@Costantino

I have added an example of calculations with EBIT.

Am I overlooking or missed something here?

Costantino

Aug 30, 2017

Hi Donny,

I think, the problem is the date '2003-01-01'. There are no data before and instead of returning NAs the missing data are filled with the only available.
Try with '2016-01-01' and there are no duplicates.

Costantino

Aug 30, 2017

Anyway there is a problem when the window_length is long.
I added the quarters in the previous year:

EBIT_Q1_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 260)  
EBIT_Q2_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 325)  
EBIT_Q3_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 390)  
EBIT_Q4_PY = Previous(inputs = [Fundamentals.total_revenue], window_length = 455)

and confronted the values of AAPL with Gurufocus (also using Morningstar as source).
All values are the same, excepted for window_length = 455 (EBIT_Q4_PY):

Total Revenue

Gurufocus:
Sep15 Dec15 Mar16 Jun16 Sep16 Dec16 Mar17 Jun17
51501 75872 50557 42358 46852 78351 52896 45408

Quantopian:
6789 75872 50557 42358 46852 78351 52896 45408

6789 vs 51501 must be an error. The same problem occurs for the operating revenue.

Costantino

Aug 30, 2017

Notebook with the computation described above attached.

Costantino

Aug 30, 2017

Added window_length =520, and the result is surprising: window_length = 455 is now correctly 51501 but window_length =520 is 6789!
There is something weird with the new data retrieval mechanism... at least in research, I don't in the algo backtesting.

@Jamie, Joe could you take please a look at this, it seems to be a major issue.

Leo M

Aug 30, 2017

Hi @Costantino, thanks for posting an example highlighting the discrepancy. Hoping the backtest difference in performance can be explained by the offset issue that you and Donny have encountered and fixing that will bring the backtest performance to what it was before. Then we can isolate the effect of data corrections on our algorithms.

-Leo

Costantino

Aug 30, 2017

I investigate the issue further and the problem is once again the date interval.!
If you run the pipeline with result = run_pipeline(pipe,'2014-08-29','2017-08-29') insted of result = run_pipeline(pipe,'2017-08-29','2017-08-29') is everything fine! See the new attached notebook:
I think the algorithm backtesting is not affected by this behaviour.

Conclusion: There are no discrepancies in the data and the indexes [ -1, -65, -130, -195] or [-260, -325, -390, -455] are okay for TTM or TTM previous year data.

Anyway this confusion demonstrates again that we need a better method to access past quarterly data as pointed out by Kevin 2 days ago (https://www.quantopian.com/posts/faster-fundamental-data#59a429c2675043000da06ac3)

Joe Jevnik

Aug 30, 2017

@James O'Brien: Thanks for finding this! There is currently a problem with a set of categorical fields, including profitability grade. The full list of affected fields is:

financial_health_grade
financial_growth_grade
profitability_grade
company_status
industry_template_code
share_class_status

For now, please continue fetching these fields through the old API. I will post an update when the fix has been deployed.

@Costantino: I agree that the current API makes it cumbersome to work with quarterly or lower frequency data. While it may seem like the current API is inefficient, due to the way pipeline is implemented this is not the case.

In algorithms, pipelines are computed in 6 month chunks. This means that every 6 months we fetch all data and perform all of the computations needed to produce the next 6 months of output values. One of the biggest costs in Pipeline is retrieving the input data so we try to read data in large batches. Querying for data in large batches reduces the number of times we need to go to the disk or database, which has a high constant cost regardless of the amount of data being read.

Imagine that we had an API that presented the user with the current value and the trailing quarter's value for some field. On the first day of the computation, when today = 2016-03-01, we would need to have read exactly two rows: the current day's value (2016-03-01) and one quarter ago's value (2016-01-01). The table on the left shows the raw source data, and the table on the right shows the slice of data that is presented in a custom factor.

day 1

On the second day of computation, when today = 2016-03-02, we would need to read two more rows. This means we have read four rows in total: 2016-01-01 and 2016-03-01 from the previous computation and 2016-01-02 and 2016-03-02 from the current day's computation. Again, the table on the left shows the raw source data and the table on the right shows the slice of data that is presented in a custom factor.

day 2

If you repeat this process for at least one quarter, we will have read every row from 2016-01-01 to 2016-03-01. Because we know that all of these values will be used eventually, it is more efficient to query for them as one dense block. In theory, we could hold only the two rows in memory at a time; however, the time cost of reading the data a few rows at a time would make this infeasible.

In research, users may choose to run smaller windows which would not require every value from 2016-01-01 to 2016-03-01. The optimization of querying in a dense block still holds because it is more efficient to read a contiguous block of data than to do random access. We would spend more time determining which rows to filter than we would just reading all of the rows. Even if we could build an indexing scheme to more efficiently read these non-contiguous regions, the absolute time saved when querying for less than one quarter of data would be fractions of a second, and the ram cost of a few rows is negligible.

Hopefully this helps explain why using a long lookback window may be as efficient as querying for trailing quarters.

@Ian Worthington: I apologize for the inconvenience, a few of the fields have been renamed slightly:

manual_renames = {  
    'dividend_yield': 'trailing_dividend_yield',  
    'dividend_rate': 'forward_dividend',  
    'diluted_eps': 'diluted_eps_earnings_reports',  
    'basic_eps': 'basic_eps_earnings_reports',  
    'basic_eps_other_gains_losses': (  
        'basic_eps_other_gains_losses_earnings_reports'  
    ),  
    'diluted_eps_other_gains_losses': (  
        'diluted_eps_other_gains_losses_earnings_reports'  
    ),  
    'basic_average_shares': 'basic_average_shares_earnings_reports',  
    'diluted_average_shares': 'diluted_average_shares_earnings_reports',  
    'average_dilution_earn': 'average_dilution_earnings',  
    'basic_extraordinary': 'basic_extraordinary_earnings_reports',  
    'normalized_basic_eps': 'normalized_basic_eps_earnings_reports',  
    'diluted_discontinuous_operations': (  
        'diluted_discontinuous_operations_earnings_reports'  
    ),  
    'diluted_continuous_operations': (  
        'diluted_continuous_operations_earnings_reports'  
    ),  
    'basic_accounting_change': (  
        'basic_accounting_change_earnings_reports'  
    ),  
    'continuing_and_discontinued_basic_eps': (  
        'continuing_and_discontinued_basic_eps_earnings_reports'  
    ),  
    'diluted_extraordinary': 'diluted_extraordinary_earnings_reports',  
    'tax_loss_carryforward_diluted_eps': (  
        'tax_loss_carryforward_diluted_eps_earnings_reports'  
    ),  
    'tax_loss_carryforward_basic_eps': (  
        'tax_loss_carryforward_basic_eps_earnings_reports'  
    ),  
    'basic_discontinuous_operations': (  
        'basic_discontinuous_operations_earnings_reports'  
    ),  
    'continuing_and_discontinued_diluted_eps': (  
        'continuing_and_discontinued_diluted_eps_earnings_reports'  
    ),  
    'basic_continuous_operations': (  
        'basic_continuous_operations_earnings_reports'  
    ),  
    'normalized_diluted_eps': 'normalized_diluted_eps_earnings_reports',  
    'dividend_per_share': 'dividend_per_share_earnings_reports',  
    'diluted_accounting_change': 'diluted_accounting_change_earnings_reports',  
}

The _earnings_report suffix denotes this attribute is about an entire company, and we may have an attribute of the same name for each share class.

The other renames are to clarify which direction a field is looking, for example: 'dividend_yield': 'trailing_dividend_yield'. We also have forward_dividend_yield so we wanted to clarify what these fields meant.

I am currently looking into the other issues posted in this thread. Thank you all for the great feedback and testing!

Disclaimer

Donny Van Acker

Aug 30, 2017

@Costantino

2016-01-01 (the actual date is 2016-01-04, 2016-01-01 is a friday) gives me duplicates...

I tried with dates 2003-01-01, 2004-01-01, 2005-01-01 and each time I get duplicate values.
2006, 2007 are fine. 2008-2011 again duplicates.
2012-2013 is fine. 2014-2016 again duplicates (all tested on 01-01)
2017-01-01 is fine again.
What is going on here??

EDIT: changing windows to 1, 70, 130 and 195 seems to solve it for now.

Leo M

Aug 30, 2017

@Joe, there could be more morningstar data that has offset or data changes. I will do some debugging on my end to narrow down why one of my algos performance changed when switching to the new format. The returns have actually gone up, but my concern is with a drawdown in one period that spiked in the new format. I will spend some cycles this weekend and email you some datapoints. Was hoping it was a generic offset issue, but apparently not.

Jamie McCorriston

Aug 30, 2017

To everyone on this thread, thank you for all of the bug reports. These are extremely helpful to us.

@Costantino: The issue that you reported with the pipeline results changing based on the lookback window length is indeed a bug. Joe has identified the problem and is working on a fix. Thanks for digging it up! We can post more detail on the issue once the fix is up.

@Leo: As you suggested, the changes to the data might not be systematic. It's more likely that there are changes that don't span an entire field or date. If you find a pattern, or simply report a collection of changes that you noticed between algos that you think might be incorrect, we can certainly take a look.

Disclaimer

Leo M

Aug 30, 2017

@Jamie, I have been able to narrow down the problem (why simple text substitution of morningstar.* to Fundamentals.) was giving me different performance and also able to figure out how to change my algo using Fundamentals. to get back my original performance (from morningstar*), all I had to do to fix the problem was expand the windowlength and negate the effects of expanded window with a matching start offset to access the original offsets that I was using in the algo. I don't want to give out my algo here, but I will produce a different algo that will illustrate the issue quite convincingly and I will send that algo over the weekend.

Leo M

Aug 31, 2017

@Jamie, I sent you two algos whose performances are different after switching from morningstar.* to Fundamentals.* in the custom factor. Hope that helps.

Leo M

Aug 31, 2017

@Joe, I sent you two algorithms as well. Morningstar-operation_margin and Fundamentals-operation_margin. They are identical algorithms except for the morningstar* changed to Fundamentals*
I ran each algo with the following settings. The performances are different as listed below. I have tried different morningstar variables in this algo (about 3 or 4 and everytime the performance is different even though the only change I have done is morningstar* to corresponding Fundamentals.*). Please let me know if the algos I sent (along with the ones I sent to Jamie) can be used for debugging or if you need something in research that is easier to debug please let me know and I can try to get that as well.
Settings:
From 2003-10-01 to 2017-07-01 with $10,000,000 initial capital

Morningstar-operations_margin::

Total Returns
-16.61%

Benchmark Returns
217.1%

Alpha
-0.01

Beta
-0.01

Sharpe
-0.25

Sortino
-0.35

Volatility
0.05

Max Drawdown
-26.12%

Fundamentals-operation_margin::

Total Returns
-7.75%

Benchmark Returns
217.1%

Alpha
-0.00

Beta
-0.02

Sharpe
-0.09

Sortino
-0.13

Volatility
0.05

Max Drawdown
-22.19%

Donny Van Acker

Aug 31, 2017

@Jamie and Joe,

The attached notebook shows the difference in quarterly data between the old and new API with window lengths of 1, 65, 130 and 195.

You should also try it with dates 2014-01-05 and 2014-01-15!
3 different differences between the API's...

Costantino

Aug 31, 2017

@Joe, Thanks for your full explanation, I'm quite impressed!
I understand, there will be no performance gain, but the thing I don't like is using offsets like 1, 65, 130 and 195. What if the company filed later? the risk of duplicates is high... a way to avoid this problem, could be to pass a timeframe for window_lenght. For example window_lenght=4 and timeframe=quarterly to get the last four quarters, instead of window_lenght=196 and then the indexes above.

I've another question, that could maybe further speed up the performances, at least in backtesting.
The Pipeline data are always updated every day, isn't it? At least it was so with the previous mechanism.
What if an algorithm trades less frequently, for example quarterly. Wouldn't be better for the performance, if the data are update only when pipeline_output(.) is invoked?

Thanks
Costantino

Jamie McCorriston

Sep 1, 2017

Leo,

Thanks for sending over those strategies. Once we have finished fixing the bug related to the lookback window length, we can run them again and see if the differences persist.

Disclaimer

Alan Coppola

Sep 1, 2017

Just so I'm clear about this: (modulo the outstanding bugs)

General conversion rule from morningstar fundamentals to Q Fundamentals:
-Replace morningstar by Fundamentals.
-Remove all intermediate hierarchy in the name
e.g.

operating_income = morningstar.income_statement.operating_income.latest

transforms to

operating_income = Fundamentals.operating_income.latest

I used this conversion rule with my algos and it works, yet I can't find this stated anywhere except by example.

Marc Johnson

Sep 2, 2017

@Jamie says above "At the end of September, the old namespace will be deprecated and redirected to point at the new data. If you don’t manually update your notebooks or algorithms to use the new namespace, they will automatically start to use the new data."

How does this impact algos currently running in a contest? E.g. if they were entered months ago and are mid-way through the six month contest ... what happens? I am not able to update them since they are locked for the contest. Or will existing contest entries continue to function in the contextual namespace they were started within?

Jamie McCorriston

Sep 4, 2017

@Alan: I'm glad that you got the new version to work - you got the general conversion rule correct. The purpose of this post was to announce the upcoming change and demonstrate how to convert to the faster namespace. We're in the process of updating the documentation to the new namespace, but we're also working out a few bugs that have been reported further up in this thread.

@Marc: You shouldn't need to update your contest algos. At the end of September, the quantopian.pipeline.data.morningstar module will start pointing to the quantopian.pipeline.data.Fundamentals dataset. This will be treated like other data corrections in the contest. Your paper trading track record up to the end of the September will not be affected, but going forward, the algorithm will warm up and act on the new version of the dataset. No code action should be required, but we advise you to backtest a version of your contest algorithm with the new namespace. If you uncover any issues with the new version, let us know and we will work to fix them before the cutover date. Does this help to clarify things?

Disclaimer

Christopher Osborn

Sep 5, 2017

Hi Joe and Jamie,

I'm having trouble migrating my algorithms over to use the new fundamental data. Some of the new fields values don't look correct:

new trailing_dividend_yield is always NaN (does not match the old valuation_ratios.dividend_yield field)
new file_date does not always have up to date earnings filing date e.g. Autodesk on 2017-09-01 new filing date is 2017-05-31 but old field (financial_statement_filing.file_date) is (correctly) 2017-08-24
BusinessDaysSincePreviousEvent() factor returns a large negative number if passed the new file_date, but works ok if passed the old field

The attached notebook shows the problems. I'm a Python and Quantopian newbie so it's possible that my code is incorrect.

Christopher Osborn

Sep 5, 2017

Forgot to attach the notebook, so here its is!

Leo M

Sep 7, 2017

@Jamie/@Joe, I plan to test this feature/enhancements (with various fundamentals and window lengths quite a bit) after you announce the fixes.

I realized that the time spent now in QA/testing of this feature will potentially save 10-100x time that could be wasted in dealing with issues involving bugs in fundamental retrieval (later on if those issues are not found and addressed now).

kamran sokhanvari

Sep 8, 2017

I have a long/short algo that I run live (until the end of the month that is). I updated the algo with the new Fundamental code and as of live trading yesterday the portfolio symbol set that was traded is significantly different from the one produced by the backtest code. Is there a reason for this?

Can someone at Q help.

Thanks

Jamie McCorriston

Sep 8, 2017

@Christopher: Thanks for reporting that issue. We're investigating the cause of the NaNs.

@Leo: We'll post back here as soon as the fixes are made. We're hoping to have an update on Monday with the majority of the bugs fixed.

@Kamran: Did you compare a backtest of your strategy with the two versions? I'm curious if you only saw a problem in live trading or if you saw the same change in backtesting. There were several data corrections so it's possible that if you're using dynamic universe selection that the names coming out of the pipeline have changed. If you'd prefer to respond privately, feel free to email in to [email protected].

Disclaimer

kamran sokhanvari

Sep 8, 2017

Jamie, I did compared with the backtest. I printed out the symbols and the difference was for some reason significant (60 delta for a 100 position port). The live protfolio traded yesterday. So I was running the backtest today to compare the symbols and found the delta. This makes no sense as the code running live is identical. I am confused !

kamran sokhanvari

Sep 8, 2017

BTW, one more test. If I start the the backtest from only two weeks ago then the backtest symbols match the live in the updated Fundamentals version. If I start the algo from earlier times the portfolio symbols no longer match. I usually run the backtest from beginning of the year. This should not mater as the list of the symbols is generated from the fundamentals factors at the beginning of the day. So the start of the backtest should not impact the ending portfolio symbol set.

This is strange ...

Jamie McCorriston

Sep 8, 2017

Hi Kamran,

It sounds like you're running into a bug that was reported earlier in this thread. We're working on the fix and hope to have it updated some time next week. Sorry for the confusion.

Disclaimer

kamran sokhanvari

Sep 8, 2017

Thanks Jamie, Yeah this was a bit confusing and created some unnecessary trades.

Can we assume then that we really don't need to change anything as far as the code is concerned since these name spaces will be merged. That way I can leave the old code as is and wait for the back-ends to switch over.

Jamie McCorriston

Sep 8, 2017

Yes, you can leave the old version and it will switch over at the end of the month. However, I recommend that you try out the new version again next week after we've posted the update to make sure that the odd behavior that you experienced goes away.

Disclaimer

Andy

Sep 9, 2017

Ran into the same issue when using Q3000US() from quantopian.pipeline.filters.fundamentals instead of from quantopian.pipeline.filters.
Both in live trading and in backtests the returned universe is way too small after mid July 2017 (showing only some 100 equities instead of 3000).

Question: For affected live trading contest entries, do we have a hope of rerunning those for the affected dates after the bugfixes are applied? That change created quite some weird effects in the live trading.

Yiran Huang

Sep 14, 2017

Thank you for the work on this optimization. Could you also post a list which includes all the available fundamental data in the new implementation? A few metrics I would like to use do not seem to appear in the new implementation.
Something like https://www.quantopian.com/help/fundamentals would be great.

Otto Dandenell

Sep 15, 2017

When I am trying the new fundamentals API out today (Sept 15), the Sector values are -1 for most companies.

Otto Dandenell

Sep 15, 2017

Furthermore, the country codes don't always match between morningstar data and new Fundamental data.

Otto Dandenell

Sep 15, 2017

Furthermore, the shares outstanding number are different and often lacking altogether in the new Fundamental data.

Otto Dandenell

Sep 15, 2017

Whoa! The free cash flow figures are even more off, sometimes on the opposite side of the positive - negative scale. Which numbers can we trust!?

Josh Payne

Sep 15, 2017

Hi Otto,

Thanks for your help! We strongly suspect these problems you've identified and those previously identified in this thread have a common root cause. We're in the process of testing out our solution and hope to have it pushed out next week for the community.

Josh

Disclaimer

Andy

Sep 15, 2017

Thanks for the great investigative work Otto! Great stuff.

Josh - any thoughts on if/how you're planning to handle affected contest entries yet?

Jamie McCorriston

Sep 15, 2017

Andy: Did you submit a new contest entry with the new Fundamentals API? Contest entries that were submitted with the old API won't start reading from the new data for a couple more weeks. We're expecting to have the various bugs and issues, including the one impacting the Q3000/Q1500 resolved before then.

Disclaimer

Andy

Sep 17, 2017

Yes, that's exactly what i did, Jamie. Backtests looked fine and so I converted them directly, but then more recent data seems to have issue.

Jamie McCorriston

Sep 17, 2017

Thanks for clarifying, Andy. Your best bet is probably to withdraw your submission (stop the algorithm) and resubmit with the old morningstar Pipeline API before the 9:30AM ET deadline on Oct. 2. Alternatively, you can wait until we publish the bug fixes and resubmit with the new API, but the old API will be redirected to the new system at the end of the month anyway, so there shouldn't be much difference between the two options. Unfortunately, we won't be able to re-run the affected dates of entries that have already been submitted with the new namespace. Sorry for the inconvenience.

Disclaimer

Andy

Sep 17, 2017

Bummer. Thanks for clarifying Jamie. Appreciate your help.

Jamie McCorriston

Sep 19, 2017

Hey everyone, we released fixes in both research and algorithms for several of the problems that were reported in this thread:
- Data should no longer be changing with different window lengths. This issue was actually the root cause of many of the problems described in this thread. We believe the problem is fixed with all fields except for the file_date field, which we are still working to solve.
- The Q Universe should now return the appropriate number of securities for any window.
- Classifiers such as the profitability grade are now working.

We are still working on a couple of issues:
- As mentioned above, the file_date still changes in a couple of cases with the window length, so we are working to fix this.
- The trailing_dividend_yield still returns NaN. This is also on our list to fix.

If you are still having issues with fields other than file_date or trailing_dividend_yield, please let us know. An thank you for all your continued hard work uncovering these problems!

Disclaimer

Jamie McCorriston

Sep 19, 2017

Here is a notebook demonstrating the fixes.

Disclaimer

Leo M

Sep 20, 2017

@Jamie, still seeing a difference in total returns and max drawdown in one QA algorithm that I created for this. I have emailed it to you. Will QA some more with different window_lengths and fundamentals.

kamran sokhanvari

Sep 20, 2017

Jamie, I rand my old test and the symbol list issue that I had reported seems to have been fixed in this version.
Thanks

kamran sokhanvari

Sep 20, 2017

BTW on a separate note, SPLS symbol gets selected for trading eventhough data.can_trade is checking for it. This symbol is not tradable as of last month. Is there a reason for it still being selected as tradable ?

Jamie McCorriston

Sep 20, 2017

Kamran, I'm glad that the issue with the symbol list has been resolved. Regarding the data.can_trade question, it's tough to say what the issue is without seeing the code. Would you be able to open a ticket with support and grant the permission to look?

Thanks,
Jamie

Disclaimer

Tyler Bandy

Sep 22, 2017

I’m still having problems with the Morningstar sector code (formerly [morningstar.asset_classification.morningstar_sector_code], now [Fundamentals.morningstar_sector_code]).

While most stocks have the appropriate sector code, many financials that were formerly classified as 103 are now -1. Let me know if you need any logs. Good luck with the fix!

Donny Van Acker

Sep 24, 2017

@Jamie,

When changing the date, it seems the window length still has some bugs..

Thanks for all the hard work!

Jamie McCorriston

Sep 27, 2017

Donny, do you think you could point me to the particular numbers that I should be looking at in the notebook? Is there a difference between the old and new fundamentals that you were looking at, or should I be changing the window length of one of the terms?

Disclaimer

Jamie McCorriston

Sep 29, 2017

Hi Everyone,

We are pushing back the date on which the old fundamentals namespace will start pointing to the new data. There are still some open issues with the new system that we would like to solve before switching over. We still expect the redirection to occur within a couple of weeks, and I'll post back here when we decide on a target date.

Thanks again for all of your continued help improving the new system.

Disclaimer

Donny Van Acker

Oct 1, 2017

Jamie,

I'm not completely sure I understand your question.

I have made some changes in the notebook that I posted on Aug 31, 2017.
Now I use pipe for the new fundamentals and pipe2 for the old morningstar to avoid any mistakes while loading the data.

I still use 1, 65, 130 and 195 as window length (same as in the notebook)

For example (as you should see in the notebook?) the first Equity in the list 'ARNC' (and many others):

NEW Fundamentals:
2004-01-12 00:00:00+00:00
Equity(2 [ARNC])
EBIT_Q1 NEW EBIT_Q2 NEW EBIT_Q3 NEW EBIT_Q4 NEW
506000000.0 506000000.0 473000000.0 4.500000e+08

OLD Morningstar
2004-01-12 00:00:00+00:00
Equity(2 [ARNC])
EBIT_Q1 OLD EBIT_Q2 OLD EBIT_Q3 OLD EBIT_Q4 OLD
506000000.0 473000000.0 4.500000e+08 -8.900000e+07

Q1 and Q2 with NEW fundamentals are the same (=Q1 OLD)
Q3 NEW = Q2 OLD
Q4 NEW = Q3 OLD

I hope this answers your questions.

Stanley Yang

Oct 9, 2017

I found many valuation ratio are different in new and old fundamental Data!
For example : New and Old system in PB ratio are different!
One of my value strategy picks different stocks in portfolio result big different return in back-testing.
So my Question is which one is correct?
thanks!

Donny Van Acker

Oct 9, 2017

Stanley,

As posted by Jamie in his first post:

In addition to performance improvements, there are changes to the underlying data. The new implementation includes a large number of corrections across many assets and many fields.

Stanley Yang

Oct 10, 2017

Thanks for reply; I know they change some underlying data. I just want to confirm the new fundamental data is more accurate from Q's people because of the recent problems in new fundamental data. Besides, my back-testing result using the old data consistently beats the results with new data, especially the concentrate portfolio strategies (maybe it’s just coincidence!). I hope the change in new data should be accurate and reliable for Q's users. In my opinion, in a really great research environment should have abundant data and clean data, some important data for many user to back-testing like valuation ratios or some basic ratios should always be the same overtime no matter what the system improvement.

Jamie McCorriston

Oct 10, 2017

Stanley,

Thanks for reporting the differences. Donny is right about the fact that there are several data points that are being corrected in the new system, but it's still helpful to hear about major differences in backtest results. We are working on updating new fundamentals with another round of bug fixes in the next few days. These fixes are already available in research and will soon be available in algorithms and paper trading. I'll post back here when we push the update.

Disclaimer

Donny Van Acker

Oct 10, 2017

@Stanley,

I noticed corrections in Enterprise value.
Those corrections also reduced my returns. Constantino posted a while ago that he also had a drop in returns with the new Fundamentals.
Seems like it's not coincidence. Let's hope the bug fixes improve our returns ;-)..

@Jamie,

Maybe it's a bit off topic but from the introduction of the new Fundamentals up until a week ago I was only working in research.
I made a stupid mistake to not run a full backtest on my best performing algo with morningstar.
I think I remember the settings I used but I'm unable to replicate those returns now. I've been wondering for allmost a week if there also have been changes in the old morningstar?
My algo's with a yearly rebalancing frequency perform the same. Those with a higher rebalancing frequency not.
Am I losing my mind here or has something changed with morningstar or something else?

Jamie McCorriston

Oct 10, 2017

Donny, nothing was intentionally changed with the old fundamentals API, and we haven't had any other reports of changes in algos that use the old API. Unfortunately without the full backtest, I don't have a good way to compare the results. If you happen to find a full backtest that achieves the same results that you are remembering, please let me know so that we can compare it to the new result.

Disclaimer

Jamie McCorriston

Oct 12, 2017

We pushed out fixes for all the remaining known issues with the new fundamentals API. We still expect differences to exist between the old and new systems. When looking at differences between the old and new API, it can be helpful to look at the 'asof dates' of the field in question. For example, I've attached a modified version of Stanley's notebook from earlier in this thread which displays the 'asof dates' for PB ratio. You can see that the differences in values are due to the fact that the new system is picking up more recent data points that were missed in the old system.

Another source of differences for fields coming from earnings reports is that preliminary reports were sometimes being picked up in the old system. This is no longer the case in the new system, so the values you see in the new system will be as reported (picks up restatements when they occur).

If anyone still sees differences that they don't believe are explained by either of the changes I mentioned above, please let me know. Barring any other major bugs being discovered, we plan to redirect the old API to start pointing at the new system on Wednesday, Nov. 1.

Disclaimer

Donny Van Acker

Oct 12, 2017

Jamie,

Bad news!
Those bugs with the window length still exist.

I've assembled a notebook that shows the problem.

The new API changes data on 2003-11-14.
The old API changes data on 2003-11-17. (Nothing weird so far)

But:
If I use window length = 34 on 2003-11-17 with the new API, I get the same data as with window length = 1. (This is wrong)
Changing window length to 35 gives me the correct data.

Using window length 1 and 2 with the old API on 2003-11-17 gives the correct data.

How can I get 'asof_date' through a custom factor? It returns numbers instead of dates.
Adding .latest gives an error..

Jamie McCorriston

Oct 13, 2017

Donny,

Thanks for sharing that notebook. I took a look, and the results from the pipeline are what I expect them to be with window length 35 (as well as lengths 1 and 34). The asof date for AAPL's EBIT suggests that it was updated on 2003-09-30. When the window_length is changed from 34 to 35 on 2003-11-17, the custom factor is now reaching back past the last update and instead returns the previously known value, published on 2003-06-30. I've attached a modified version of your notebook that demonstrates this a little more clearly.

Edit - The below statement is incorrect, see later post for correct version.
~When using the 'asof_date' in a custom factor, the date is converted to a floating point value (epoch time), since Factors in pipeline have numerical outputs by definition. If you actually print out the values inside the CustomFactor's compute function, you can see that it's a datetime. I think you can create a CustomClassifierto get the latest asof_date(see the code in Zipline), but it's not an officially supported function on Quantopian so my preference is to convert it to a timestamp outside of Pipeline for now. You can see how I did this in the attached notebook.~

Let me know if this example answers your questions or if you there's anything that I missed.

Disclaimer

Jamie McCorriston

Oct 13, 2017

@Karl,

Can you share the algorithm or notebook that you ran which raised that Runtime Error? Attaching the code in this thread or sending it in via our support channel if you want to keep it private would be very helpful. Without more context, like the particular datasets/fields that are being used, it's difficult to help debug the issue.

Thanks,
Jamie

Disclaimer

Jamie McCorriston

Oct 13, 2017

Karl, sending the code in def myPipeline() might be enough. If you post it here, or send it to [email protected], I can take a look. Thanks!

Disclaimer

Jamie McCorriston

Oct 13, 2017

@Donny, One of our engineers pointed out to me that there's a mistake in the last notebook I shared with regards to using datetimes in Pipeline. Factors in pipeline should output numerical values or datetime. That being said, the default output type of a factor is float, so you need to specify the dtype of the factor if you want it to output a datetime. Attached is a corrected version of the notebook that I posted earlier.

Disclaimer

Stanley Yang

Oct 14, 2017

Thanks for replies and interpret the difference of data.
I have question about how to get the latest update PB ratio data. As this notebook I post, I like to remove the securities which newpb_date didn't update before the day I run pipeline. So in this case I used pbtimerank>100 to filter out, but I still see some securities like Equity(3735 [HPQ]) update on 2016-03-03 not update on 2016-12-01.
I am a beginner in python , so I still can't think a good way to solve it .
Is there any way I can remove it more precisely in pipeline other than this strange method to adjust the number in pbtimerank?
Thanks!

Donny Van Acker

Oct 14, 2017

Jamie,

About the asof dates, thanks for the PreviousAsOf CustomFactor!

But about the data for Q1, Q2 and so on. I don't understand your example.
The asof date for EBIT suggests that it was updated on 2003-09-30 when in fact it only updated on 2003-11-14!

So up until 2003-11-14 using window length 1, 34 and 35 gives 9000000. Nothing fancy here.
On 2003-11-14 window length 1 gives the new value 132000000. Window lengths 34 and 35 the old value 9000000.

Now it's getting weird for me.
On 2003-11-17 window length 1 gives 132000000 as it should.
But now all of a sudden window length 34 also gives 132000000! I just don't get that.

As clearly seen in 'EBIT Q1 NEW', the update happened on 2003-11-14
So 34 trading days ago starting from 2003-11-17 is way past 2003-11-14.
I think that window length 34 should still give a value of 9000000.

If not, then I would like to know how we can get the 4 most recent different values for the quarterly data?
At this moment window lengths 1, 65, 130 and 195 don't do the job anymore..

Jamie McCorriston

Oct 16, 2017

Ah, I think I understand where the confusion is coming from now, Donny. Quarterly earnings data is generally only available some time after the end of the quarter.

What's happening here is that we learned about the 2003-09-30 data point (EBIT = 132000000) on 11-14. On 11-14, the term with window_length=1 in the pipeline asks for the 'most recently known EBIT for AAPL', which is now 132000000. Also on 11-14, the term with window_length=34 is asking for the value from 34 trading days ago (34 trading days ago was 9-29). Even though a more recent data point exists, the pipeline explicitly asks for the value from 34 trading days ago, which is 9000000.

On 11-17, the next trading day, the pipeline term with window_length=34 is now asking for the data from 9-30, which is 132000000. The window_length=35 term is still asking for data from 9-29, which is 9000000. The next day, the window_length=35 term picks up the 132000000 data point from 9-30.

The important thing to take away from this is that the asof date does not represent when we learned about a data point. Internally, we keep track of what we call a 'timestamp' or knowledge datetime of when we learn about a particular data point from Morningstar. The 'timestamp' determines when a data point is known by pipeline, while the asof date is the date for which the data point applies. I made a post earlier this year that discussed the concept of asof dates and timestamps and how we use them with our partner datasets. The exact implementation is a little bit different for fundamental data (we don't store publicly available base tables and delta tables), but the concepts are the same. One of our data engineers also ran a webinar last week that discusses similar concepts. The webinar was recorded and should be made available some time this week. I'll post a link to it here once it's up. I'd recommend watching it if you're interested in this topic.

Let me know if this explanation helps.

Disclaimer

Jamie McCorriston

Oct 16, 2017

Actually, I should add a bit to my explanation. As you know, Quantopian didn't exist yet in 2003, so we didn't actually learn about the 2003-09-30 data point on 2003-11-14. However, because we don't always learn about quarterly earnings report data the day after the end of the quarter, there's an artificial lag placed on quarterly earning fundamental data prior to when we actually started collecting the data live in 2013. Essentially, 'timestamps'/knowledge datetimes are simulated until we started actually collecting them in 2013.

I hope this helps.

Disclaimer

Jamie McCorriston

Oct 16, 2017

@Stanley, the best way to filter out securities with an asof_date older than the previous trading day in a pipeline is with the BusinessDaysSincePreviousEvent factor. I've attached a modified version of your notebook that demonstrates how to use it.

Let me know if this is what you were looking for.

Disclaimer

Jamie McCorriston

Oct 16, 2017

@Donny: Here is the recording of the webinar discussing knowledge dates and asof_dates by Kathryn, one of Quantopian's data engineers.

Disclaimer

Stanley Yang

Oct 17, 2017

@Jamie: This is exactly I want,Thanks!
I tested different dates at the notebook you posted,It's look like PB ratio and some other ratios updated at the end of month before 2015 and updated more often after 2015 ?

Jamie McCorriston

Oct 17, 2017

Hi Stanley,

Thanks for pointing that out. Our Morningstar fundamental data integration began in June 2014. In order to backfill data prior to 2014, Morningstar sent us historical data going as far back as 2002. The data that we got for daily updating fields like pb_ratiois only updated monthly in these historical files. If you run a pipeline that crosses over the end of a month (see attached notebook), you can see that you get a new value at the end of each month for pb_ratio. Starting mid-2014, we started collecting the data live and building our own collection of historical data where daily fields are updated daily. If you are using a filter that requires these fields to have been updated the previous day (good practice!), you should consider the difference before and after June 2014. In fact, you might want to restrict your backtesting to start after June 2014 so that the update pattern reflects what you can expect in live trading.

Let me know if this explanation helps and sorry for the confusion.

Disclaimer

Stanley Yang

Oct 18, 2017

@Jamie:
Thanks! It's very helpful.

Donny Van Acker

Oct 28, 2017

Jamie,

Thank you for the link to the webinar. I've watched it and I think I understand most of it.

But I'm still stuck on how to get the 4 most recent, different values for the quarterly data?
With the old morningstar you could use 1, 65, 130 and 195 as window length.

With the new Fundamentals I guess you have to calculate the trading days between 'today' (= Q1) and the most recent 'asof date' (-1 = Q2).
Then you have to add 65 tradings days (with some margin ) to get Q3 and so on?

Is this the correct way to get Q1, Q2, Q3 and Q4?
And how do you implement that in research and backtesting?

Jamie McCorriston

Oct 30, 2017

Hi Donny,

The system that you used with old fundamentals should work to the same degree of success for determining the last 4 quarter's values in the new system. In general, this should get you the last 4 quarterly data points. I can imagine cases where this heuristic fails, like if the most recent quarterly report is late, which could cause the most recent two data points to be the same. As you mentioned, a better way of doing this might be to use the 'asof date' of the field in question. I haven't implemented this before, so i don't have one to share, but if you want to try the solution that you suggested where you pick the most recent asof date, and then add 65 trading days to the lookback to get each previous quarter, you might want to use the BusinessDaysSincePreviousEvent factor that I used in my response to Stanley just above.

The 1/65/130/195 lookbacks should be a good heuristic. A more accurate solution will certainly be more involved and take some work.

Disclaimer

Costantino

Oct 31, 2017

Why using BusinessDaysSincePreviousEvent with a window_length different from zero, for example:

'sales_t_0065_update': BusinessDaysSincePreviousEvent(inputs = [Fundamentals.total_revenue_asof_date], window_length = 65),

the following error occurs:

/build/src/qexec_repo/zipline_repo/zipline/pipeline/factors/events.pyc in _compute(self, arrays, dates, assets, mask)
     35  
     36         # Coerce from [ns] to [D] for numpy busday_count.  
---> 37         announce_dates = arrays[0].astype(datetime64D_dtype)  
     38  
     39         # Set masked values to NaT.

AttributeError: 'zipline.lib._int64window.AdjustedArrayWindow' object has no attribute 'astype'

Is it a bug or a misuse of BusinessDaysSincePreviousEvent?

Costantino

Oct 31, 2017

I computed the median of BusinessDaysSincePreviousEvent results in the Q3000 universe and it was 68.
Therefore the 1/70/140/210 lookbacks are a better than 1/65/130/195 in order to reduce the risk of duplicate datapoint.

Jamie McCorriston

Oct 31, 2017

Costa,

The BusinessDaysSincePreviousEvent factor does not work with the window_length being set. It will always return the number of business days from today since the previous event in the inputs term. The factor should raise an error if you specify a window_length, sorry about that.

Disclaimer

Donny Van Acker

Nov 1, 2017

Jamie,

The 1/65/130/195 lookbacks don't perform with the same degree of success in the new system (at all). They perform way less.
I've demonstrated this in the attached notebook. I've tried to show you this before.

The reason, as I understand it, is that the old system picked up new data at each new data point or timestamp (not the asof date), no matter what lookback you used.
So the gap between two consecutive data points (or quarterly values) was always more or less 65 (or 68 as Constantino calculated).

With the new system, window length 1 picks up new data at a new data point (like in the old system) but all the other lookbacks start to change data on the asof dates. You demonstrated this here.

That new behaviour increases the gap between Q1 (lookback 1) and Q2 (lookback 65 + ('current' date - asof date) ).
Thus frequently resulting in wrong values for Q2, Q3 and Q4 with lookbacks 65/130/195.

Christopher Osborn

Nov 2, 2017

Jamie,
Has the trailing_dividend_yield NaN problem been fixed? After reading through the replies on this thread I'm not sure whether it's something that you think has been fixed or not. My notebook, attached, still gives NaNs. (The file_date and business_days events now look correct - thanks.)

Jamie McCorriston

Nov 4, 2017

@Everyone: The old API should be redirected to point at the new data by Monday morning. get_fundamentals will also no longer be available as of Monday.

@Christopher: The NaN problem with trailing_dividend_yield has been fixed and tested on our staging environment, but has not yet made it out to our production environment. The fix should be up on Monday.

@Donny: Thanks for staying patient with me. Your most recent comments helped me to understand the root of what's giving you a hard time. In the old system, fundamentals would just start appearing in a pipeline lookback window at the timestamp. In the new system, the timestamp determines when the data becomes available, but the asof_datedetermines where it appears along the lookback window. The new version is consistent with how other datasets are loaded in lookback windows in Pipeline, which is why we changed the behavior. I've attached a modified version of your notebook with a CustomFactor called LastFourQuarters which should get the last 4 values of a fundamentals field in the new system more reliably. I re-ran the test periods to confirm that it works and it seems to retrieve the correct values. Let me know if this helps.

Disclaimer

Luca

Nov 6, 2017

I guess the new API have replaced the old ones, the Research environment is so incredibly memory efficient now. Well done, thank you!

Costantino

Nov 6, 2017

Which is the new name for income_statement.depreciation_amortization_depletion?

With Fundamentals.depreciation_amortization_depletion the following error occurs:

AttributeError: type object 'Fundamentals' has no attribute 'depreciation_amortization_depletion'

Jamie McCorriston

Nov 6, 2017

@Luca: That's correct, the old API is now using the new system. I'm glad to hear that this helps reduce memory usage in Research for you!

@Costantino: This weekend, we shipped a change that disambiguates some field names which we learned are being used to represent multiple data points by our fundamental data provider. The depreciation_amortization_depletion is one such field. In this case, we get a depreciation_amortization_depletion from two different reports: income_statment and cash_flow_statement. Sometimes, the data points differ depending on which report they come from. The two versions of depreciation_amortization_depletion can be referenced with depreciation_amortization_depletion_income_statement and depreciation_amortization_depletion_cash_flow_statement, respectively. I apologize for the confusion.

Disclaimer

Costantino

Nov 6, 2017

@Jamie Thanks a lot for the fast reply!
For sure, you're going to update the Fundamental Data Reference on https://www.quantopian.com/help/fundamentals.
Let us know, when you've finished.

Costantino

Nov 6, 2017

I my opinion depreciation_amortization_depletion should be always the same, maybe you should report the inconsistency to Morningstar

Jamie McCorriston

Nov 6, 2017

Hi Costantino,

Yes, we'll have to update the Fundamental Data Reference. In the meantime, I recommend using the autocomplete tool. If you import the new Fundamentals namespace and then type Fundamentals. in research or the IDE, you should be able to see all the fields that exist. This isn't a replacement for the reference page, but should help you until it is updated.

Disclaimer

Costantino

Nov 6, 2017

The same for Fundamentals.net_income (*_income_statement or *_cash_flow_statement).
In this case, I'm sure they have to be the same. If not, there is a problem in the data.

Luca

Nov 6, 2017

@Jamie was the change intended to improve memory other than speed? I cannot believe pipeline in Research is now using a quarter of memory that it was using before the change.

Joe Jevnik

Nov 6, 2017

@Luca The primary goal of this change was to improve the time it takes to query fundamentals data in pipeline. We were not specifically optimizing the memory usage but allocation is often a very slow operation. One of the ways you can speed up a task is to avoid allocation of objects, or batch up allocations into a dense block, both of these will lower the total memory used. The old fundamentals pipeline loader used a lot of memory and didn't scale very well as the query windows grew so the new system is a big improvement.

Disclaimer

Doug Baldwin

Nov 6, 2017

What is the data timeliness policy for the improved Fundamentals data? I apologize in advance if this question is off topic.

Morningstar posts 10-K/Q at midnight of the SEC filing date, but Fundamentals seems to lag by many weeks.

Also, Morningstar posts preliminary Rev, OI, NI, EPS, etc. at midnight of the press release date, which may occur weeks before the SEC filing date. Can these preliminary numbers be provided by Fundamentals?

I found timeliness gaps compared to Fundamentals by sampling http://finra-markets.morningstar.com/MarketData/EquityOptions/default.jsp at its Financials ->Quote tab for tickers that have issued a quarterly press release, and at Financials -> Balance Sheet tab for recent SEC 10-K/Q filings.

Thank you.

Leo M

Nov 6, 2017

Very fast indeed using fundamentals now. Thanks a lot for making these enhancements!

Donny Van Acker

Nov 7, 2017

@Jamie,

Thank you very much for your CustomFactors! This helps a lot. I've no idea what the code does but I'll try to examine it.

One last thing, when using your code in research or backtesting I get the following warning:

WARN numpy/lib/arraysetops.py:200: FutureWarning: In the future, NAT != NAT will be True rather than False.

Something I should ignore?

Lionel Ouaknin

Nov 8, 2017

Well the rename to trailing_dividend_yield should be seamless, right ?

This morning the Dividend Yield call i had been using started to fail.

Luca

Nov 9, 2017

I so happy now that the memory usage is damn low and I can run many NBs in parallel :)

There is only one little issue, the one reported by Lionel: the fields that were renamed are not working anymore.
e.g.
'dividend_yield': 'trailing_dividend_yield',
'dividend_rate': 'forward_dividend',

If you use the new name 'morningstar.valuation_ratios.trailing_dividend_yield', that is not found:

AttributeError: type object 'valuation_ratios' has no attribute 'trailing_dividend_yield'

If you use the old name 'morningstar.valuation_ratios.dividend_yield', this error happens:

ValueError: Names ['asof_date', 'sid', 'timestamp', 'trailing_dividend_yield'] not consistent with known names [u'asof_date', u'sid', u'timestamp', u'dividend_yield']

Jamie McCorriston

Nov 9, 2017

@Lionel & Luca: We have a fix for the trailing_dividend_yield problem on our staging environment. I will post back here when it makes it out to production. Sorry for the trouble.

Disclaimer

Jamie McCorriston

Nov 21, 2017

@Lionel & Luca: trailing_dividend_yield should now be working properly in backtesting and research.

@Costantino: The fundamentals docs have been updated with the new field names on our staging environment and should make it out to production soon.

Disclaimer

Luca

Nov 23, 2017

Every time I now use Pipeline in Research I am amazed how fast it is and how low memory impact it has. We can now explore ideas that were impossible before, due to memory and speed limits. Great job, really.

Paul Blair

Apr 30, 2018

I am noticing in using the financial_health_grade field that I get no data before July 20, 2008. Is that supposed to be the start date of that data series?

Josh Payne

Apr 30, 2018

I believe that is by design -- Morningstar does not generate back further for that particular field.

Disclaimer

Alessio Brini

Dec 7, 2018

Why does a 24-factor Pipeline queried for 4 years take more than 20 minutes? Is it normal?

Sebastian Koch

Feb 6, 2019

@Jamie
and @AnyOtherPersonWhoIsSmarterThanMe :-)

class LastFourQuartersAsOfDates(CustomFactor):  
    # Get the last 4 unique values of a given asof_date.  
    outputs = ['q1_asof', 'q2_asof', 'q3_asof', 'q4_asof']  
    window_length = 195 + 65  
    dtype = np.dtype('datetime64[ns]')

    def compute(self, today, assets, out, asof_date):  
        for column_ix in range(asof_date.shape[1]):  
            unique_dates, _ = np.unique(asof_date[:, column_ix], return_index=True)  
            if len(unique_dates) < 4:  
                unique_dates = np.hstack([  
                    np.repeat([np.datetime64('NaT')], 4 - len(unique_dates)),  
                    unique_dates,  
                ])  
            unique_dates = unique_dates[-4:]  
            out[column_ix] = unique_dates

I love Jamie's custom factor for calculating trailing 4 quarter fundamentals.
Now I try to build something that should theoretically be easier but I just can't get it right.

Instead of an array with index Q1_asof, Q2_asof, etc. with the corresponding filing dates for a Fundamental like Net_Income,
I just want to return an integer with the number of unique filing dates for that Fundamental within a given time window
(for example 5 years).

Basically I want to find out if a company had for example 19 unique filing dates in the last 5 years or just 3.

Any idea how it can be done?

Best regards,
Sebastian

Leo c

Feb 6, 2019

@Sebastian,

Do you get pipeline timeout using LastFourQuartersAsOfDates or LastFourQuarters class?

tks

Sudarshan Iyer

Apr 28, 2020

I am getting the following error when running the LastFourQuartersAsOfDates factor in a pipeline: ValueError: cannot include dtype 'M' in a buffer. Think it has something to do with outputting in datetime64[ns] format. Any idea what could be causing this issue?

Carlo G.

May 9, 2020

Yes i get the same error. Something must have changed. ValueError: cannot include dtype 'M' in a buffer.

You've successfully submitted a support ticket.

Our support team will be in touch soon.