Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Quality of Quantopian's data?

Hi,

I have two questions regarding the data that Quantopian uses:

  • Where does Quantopian receive it's data from?
  • Is the daily data reconstructed from the minute data, or is it from a
    separate database?

Thanks for answering

4 responses

Hello,

We purchased the minute-bar data. We haven't released who it is simply because we didn't get permission from our vendor! We're working on getting that in a future negotiation.

The daily bars are constructed from the minute bars.

Dan

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Hello William & Dan,

Good question. It is a bit mysterious how OHLCV data for thousands of securities trading on multiple markets could be generated at a minute-level rate. It seems like no small feat. Is the Quantopian vendor actually capturing every trade (even the high-frequency events)? I'd be interested in the details. It seems like way too much data to capture in non-volatile memory, and then post-process, right? Is it a real-time system that captures OHLCV data on-the-fly and streams to disk?

It is also a mystery why the data are expensive. I've heard various explanations, but none have been convincing. Is it a monopoly situation?

Also, I'd think that your vendor would have no problem with Quantopian advertising their service. What's there to negotiate? Or do you consider the name of the vendor to be a Quantopian competitive secret (which would be a reasonable stance)?

Grant

Data being expensive is nothing new - its because of licensing fees (paid as a service, not a one-off-cost, since you are interested in having it updated daily). The top of the chain is the exchange, which can charge very large sums for complete data access to the relatively small number of firms who can pay it. That's the top of the pyramid, and it filters down.

I assume the data is held "in the cloud" and servers with 128GB of ram is not impossible, but the load is of course spread across multiple units. It's all about virtual servers these days, where memory is actually shared among any number of servers/clusters etc, so the amount of memory used is not limited to a physical machine and can vary dynamically.

This topic raises a question: how the data provider's "brand name" relates to the quality of the data.

Would people have a different opinion of the data if it was linked to a certain data provider? This could bias someone either positively or negatively, a high profile "trusted" provider might causes traders to be over confident. Similarly, one might be negatively biased by a "low quality" provider.

Generally speaking, data is not to be trusted. If your strategy depends on clean data you will be in trouble. In the real world data is very messy; there are misquotes, errors, etc. that can cause problems. It would be a worthwhile thought experiment to look at what happens to your strategies if some small random noise is added to a real time series.