Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
[BUG ] Pipeline VIX

There seems to be data discrepancy between pipeline and actual data. Pipeline is delayed by a day which is fine, but it's not pulling the prior day value correctly either. Column 1 is going to be Pipeline value compared to the prior days actual opening value for VIX.

09/07/16 12.42 - 12.42 09/06/16
09/08/16 11.89 - 11.89 09/07/16
09/09/16 11.89 - 11.74 09/08/16 **
09/12/16 11.74 - 12.52 09/09/16 **
09/13/16 20.13 - 20.13 09/12/16

my_pipe = Pipeline()  
    attach_pipeline(my_pipe, 'my_pipeline')  
    my_pipe.add(GetVIX(inputs=[cboe_vix.vix_open]), 'VixOpen')

def before_trading_start(context, data):  
    """  
    Called every day before market open.  
    """  
    context.output = pipeline_output('my_pipeline')  
    context.vix = context.output["VixOpen"].iloc[-1]```  
16 responses

Move to wherever you guys track bugs, didn't know where to post it.

Hello Elsid,

Part of this behavior is expected, but there is an underlying bug in the data for 09/12/16.

I'll explain what I mean with part of it being expected. When your algorithm makes a data request, the platform serves the latest known value for that given data point. If we don't receive a timely update (before trading starts) for a value, the previous day's value is used (forward-filled). This happened for 2016-09-09, we didn't receive the data before trading started, so the platform used the value from 2016-09-08 (11.89). This part of the behavior you see is expected.

What's not expected is the discrepancy in the value for 09/12/16. In the attached notebook you will see how to find the timestamp of the data points, which corresponds to the actual time at which we received the update. You will noticed that the values corresponding to 2016-09-08 and 2016-09-09 were received on 2016-09-10. For 09/12/16's data request, the platform should serve the latest value, corresponding to 2016-09-09's value (12.52), but instead it serves an earlier value. This is an error on our side and I've made our data team aware of the issue. Thank you for reporting it!

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hey Ernesto,

Also you guys should look at your definition of expected behavior. I understand where you don't have the data for current day, so you pre-populate it, but you guys should be flagging this in the database and running some sort of function to repopulate those dates with the current value once you obtain it.

Because currently as it stands, if you are using pipeline data, the backtests are honestly useless, given that just for 2016 there were a bunch of miscalculated dates. For example, 09/09/16 shouldn't have stale data anymore it's been almost 6 months...

I think this is a serious issue, again because it makes backtesting fairly misleading.

Hey Ernesto,

Also you guys should look at your definition of expected behavior. I understand where you don't have the data for current day, so you pre-populate it, but you guys should be flagging this in the database and running some sort of function to repopulate those dates with the current value once you obtain it.

Because currently as it stands, if you are using pipeline data, the backtests are honestly useless, given that just for 2016 there were a bunch of miscalculated dates. For example, 09/09/16 shouldn't have stale data anymore it's been almost 6 months...

I think this is a serious issue, again because it makes backtesting fairly misleading.

One vote from me for NOT repopulating or messing with any past data. Any such change could inject 'lookahead' bias. While the data may seem incorrect it IS what would have been known as of a specific date. Even though it may be 'stale' that's presumably what came through the data feed as of that date.

This is a great example of how one needs to understand the 'messiness' of big data and account for it in algorithms. Don't expect the data provider (or Quantopian in this case) to do the work for you. It's best to assume that data is 'raw' and then consciously make any adjustments to that data in code. In this case, one could take the pipe output and manually (in the notebook environment) backfill data if the timestamp is more than a day after the as-of date. NOTE however, this WON'T work in a backtest or real trading since your algorithm cannot know future data which is the behavior one would expect in a backtest.

I'd suggest that the backtest would be MORE misleading if data was repopulated to make it neat and tidy. If you are doing research and not expecting 'real time / real world' data then simply use the notebook environment and manipulate the data with future know data as one sees appropriate.

What?? Yes let's leave Data broken and inaccurate, It is what would have been known with a shitty data feed, 99.9% of the time you would know closing values for the day at the close or shortly thereafter with a proper data provider.

Yeah nice suggestion, fix Pipelines broken data manually, instead of fix the accuracy of pipeline. Fix my exploding Samsung Galaxy phone personally, instead of Samsung actually fixing it.

You know when you trade live, you should just get a shitty data feed so you don't give yourself a 'lookahead' bias, because that's why companies spend multi-million dollars for fast, reliable, & ACCURATE data-feeds.

Don't even understand the 'lookahead bias' statement, as you would always know yesterdays data:

"Bias created by the use of information or data in a study or simulation that would not have been known or available during the period being analyzed. This will usually lead to inaccurate results in the study or simulation."

Also people assume broken data will always lead to worst results, when the opposite is true also, it can lead to overly optimistic backtest results.

@Dan: Great explanation, thank you.

@Elsid: If you want a detailed explanation of how partner datasets are loaded and surface on Quantopian, I would recommend reading this post. Essentially, data is only surfaced in a backtest when it is known. In this case, the VIX data for 09/08 was only learned on 09/10. This means that in an algorithm, if you asked for yesterday's VIX value on 09/09, you would get the stale data in your algorithm (11.89). However, if you asked for the VIX value from two days ago on 09/12 (a Monday), you would get the value from 09/08 (11.74), since it was known at this point.

The rationale behind this is exactly what Dan explained. We don't want to introduce forward lookahead bias. Instead, you can use the asof_date of a dataset to determine whether the data is fresh or stale. For a good example of this, I would recommend watching this webinar and go through the accompanying notebook.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Jamie I understand,

But I'm asking for example 09/09 data 6 months later, so how is data not available 6 months later?

It is available every simulation day in the backtester after the timestamp in the database. If you ask for the last 5 daily values of VIX on 09/13, you will get the values for 09/07, 09/08, 09/09, 09/12, and 09/13.

I don't know if my reply didn't post or it just got deleted. I am asking for those values over 100 days later, and some are wrong still.

This is all mismatched data for 2016, if anybody even cares...

3/29/2016 15.65 3/28/2016 15.65 0
3/30/2016 15.74 3/29/2016 15.74 0
3/31/2016 15.74 3/30/2016 13.69 2.05
4/1/2016 13.69 3/31/2016 13.73 -0.04

7/7/2016 16.05 7/6/2016 15.87 0.18
7/8/2016 14.8 7/7/2016 14.8 0

8/16/2016 11.81 8/15/2016 11.81 0
8/17/2016 12.57 8/16/2016 12.04 0.53
8/18/2016 12.57 8/17/2016 12.57 0
8/19/2016 11.67 8/18/2016 12.2 -0.53
8/22/2016 11.67 8/19/2016 11.67 0
8/23/2016 12.15 8/22/2016 12.53 -0.38
8/24/2016 12.7 8/23/2016 12.15 0.55
8/25/2016 13.62 8/24/2016 12.7 0.92
8/26/2016 13.54 8/25/2016 13.62 -0.08
8/29/2016 13.54 8/26/2016 13.54 0
8/30/2016 14.09 8/29/2016 14.09 0
8/31/2016 12.94 8/30/2016 12.94 0
9/1/2016 13.07 8/31/2016 13.14 -0.07
9/2/2016 13.47 9/1/2016 13.07 0.4
9/6/2016 13.47 9/2/2016 13.47 0
9/7/2016 11.89 9/6/2016 12.42 -0.53
9/8/2016 11.89 9/7/2016 11.89 0
9/9/2016 11.89 9/8/2016 11.74 0.15
9/12/2016 20.13 9/9/2016 12.52 7.61
9/13/2016 20.13 9/12/2016 20.129999 1E-06
9/14/2016 17.63 9/13/2016 15.98 1.65
9/15/2016 17.63 9/14/2016 17.629999 1E-06
9/16/2016 17.97 9/15/2016 17.969999 1E-06
9/19/2016 16.41 9/16/2016 16.41 0

Hi Elsid,

We care immensely about the quality of Quantopian's data. What is happening above is a conflation of two use cases, only one of which we support. Let me start by explaining the case we support, and then I'll comment on the one we do not support.

Case 1: High Fidelity Replay -- Our system is built to replay the data of any day in history exactly as it occurred. As a platform running 10s of thousands of algorithms against live data for both simulation and real trading, we encounter numerous data issues. Some of the data issues are vendor problems, and others are processing problems at Quantopian. When we built our data infrastructure, we designed the storage to ensure that we could reproduce the exact behavior of an algorithm on Quantopian's infrastructure from a day in history. Perversely, that meant doing quite a bit of extra work to ensure we could repeat our mistakes as well as our vendor's mistakes.

Preserving these data errors ensures we can reproduce problems, diagnose unexpected behavior, and answer questions like "how often are we late with XYZ feed". Knowing the frequency of failures means we can also run simulations that estimate the drag on performance from data issues.

This treatment is similar in spirit to our treatment of prices. At any point in a simulation, you can see the price as traded on that minute in history.

Case 2: Convenience for Research and Analysis While it is very powerful to have a high fidelity data history and this wonderful ability to repeat our mistakes, it makes the first stages of strategy research and development significantly more difficult. The first question in research a strategy is often: would this work in idealized circumstances? While our data history faithfully represents the shortfalls we had in processing data in the past, these shortfalls often amount to frustrating distractions in the research/development phase. We don't give you an option to ignore past mistakes and thereby focus on your idea, and that means we are missing an important feature.

In Closing
Both of these cases are quite important and valid, and there isn't much benefit to debating whether one is more "correct".

I'd also like to share a comment on the tone of this thread. I replied in spite of your vitriol, only because I thought the underlying use cases were so important. Normally, I don't want to reward such rudeness with attention from our team. If you can think critically enough to recognize this issue, you can think well enough to be polite when discussing it.

Thanks,
fawce

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

I wasn't trying to be rude, I just find it insulting to my intelligence, where data issues which invalidate your backtests, pretty much given that a single failing point in wrong data can make your strategy have a +/- 20% in a single day especially when people are trading such strategies with VIX involved. To have my concerns or issues basically downplayed or completely ignored with irrelevant information. Such as talking to me about forward looking bias over a year later about past data.

A better answer such as yours was would of been better, such as we know about it.

Secondly, I quite don't understand why you can't preserve these data issues on another off production database, to still be able to run your diagnostics ect. and still provide proper working functionality for case 2?

Thirdly, until whatever decision you make there should be a banner or message on top of every thread, that there are issues with historical data, and you should verify the data, or import your own. Most people are probably under the impression that backtesting historical data is good enough, as I was up until I was made aware of these data discrepancies, I even traded live some strategies based on backtests, and I see people everyday trading strategies live, based on the conclusions from inaccurate backtests.

When you create a product, or service people usually put faith in you, that things work the way they should be, and given that you are in the financial field, you are dealing with people and real money, not my twitter doesn't work properly, wrong orders, wrong backtests, wrong data, negligence in any area can have significant monetary consequences for people.

Anyway thanks for a proper response. I hope you at least consider the advice of putting up warnings on the limitations on the current data structures.

When I was trading stuff based on VIX, I brought in the data myself using fetch_csv directly from CBOE.

@Simon,

It is using fetch for the VIX futures contract info, it was using pipeline for VIX data, but yeah going to use fetch for that too now, I just wasn't aware that pipeline had data issues.