Quantopian data with Zipline?

It looks like Ziplines Yahoo data fetcher uses the adjusted data column by default. This is the prices adjusted for splits and dividends. The Quantopian backtester data is adjusted for splits but not dividends.

This can make a difference in algorithms that are making trades based on price data.

Dave, the Quantopian dataset isn't something we can share. We have the right to use the data, but we do not have the right to share it.

Yahoo's data uses different adjustment methods for dividends, which is a common reason they differ.

Yahoo's data also has survivorship issues - they don't provide data on companies that have gone bankrupt.

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Dan can you recommend a decent data provider for 1 minute bars?

Sorry, I've never looked into it. I'm always in Quantopian, so I've never done the search for alternate data sources.

Disclaimer

In his new book Ernest Chan mentions CSI, Kibot, and TickData

Also, you can get a month of 5 minute bars for free from Stooq.

Thanks Dennis and Dan. It would be great if your data provider was willing to give discounts to users of quantopian.
CSI doesn't have less than daily for any decent timeframe. kibot doesn't seem to have delisting and viable corporate actions.
tickdata is good but $50k and you need to use their tickwrite software unless you are super important.

We definitely perceive the data we provide on the website as one of the reasons that people will use Quantopian rather than zipline

I should have asked earlier. I'm curious, what is it that makes you prefer Zipline? There are a lot of good answers to the question, and 'd like to understand your particular case.

Disclaimer

I have historically traded a much bigger universe than quantopian allows. I was looking to run something on the Russell 1000 plus liquid etfs plus most liquid ADRs.

I will echo Michael's statement. Possibly for a different reason, tho.

I don't have a separate 'discovery' pipeline setup yet. So ideally I'd like to be able to process thousands of stocks in my algo as a kind of real-time screening algorithm.

OK, thanks for that feedback. I like the answers particularly because they are things we plan on providing in Quantopian in the long run.

Disclaimer

Fabian Braennstroem

Jun 21, 2013

Hello Dan,

for me the reason is, that I am using an emacs environment with vim-keybindings, which allows me a better handling of my code and loading different modules.
An easier transfer of the code between quantopian and zipline would be helpful.
Maybe I am a bit special with this emacs environment, but I expect that others are used to complex programing environments as well.

Best Regards
Fabian

Dave Gilbert

Jun 21, 2013

Hi Dan,

I find debugging much easier using a Python IDE with a debugger. Hence zipline and Pyscripter or Eclipse Pydev or IPython for me. Print/log.info helps in Quantopian, but for more complex algorithms a debugger wins hands-down. Any plans to spruce up Quantopian's debugging capabilities in the future?

Cheers,
Dave

Jun 21, 2013

Fabian and Dave -

Thanks for the feedback. I've heard both of those answers before but it's good to get reinforcement.

We've been kicking around a few solutions to the emacs problem. Perhaps something like connecting to github - you can use whatever IDE you want, save to github, github writes to Quantopian. It gets pretty powerful in that way.

The debugger is harder, but is also on the list. We have to keep the debugger on our server in order to keep the data on our servers. Sometimes we swear at our IDE like a sailor because of the debugging. We see the need.

Disclaimer

I concur with the IDE and the debugging. I also need to incorporate futures,currencies and equ which I either bought or recorded. Also, I usually start investigating ideas in ipython before implementing them.

My other immediate need is to run many identical instances to parameterize a model. I use several parameters in my main equity model and like to generate a smoothed surface of the difference in performance differences of the parameters

Dave Gilbert

Dan,

Just to reinforce what Michael said, strategy optimization by looping through different share universes and portfolios, as well as strategy parameters is easily achieved with zipline. If you haven't already, you should possibly consider allowing something like this with Quantopian, though I concede that this could severely load your servers, if not strictly controlled.

The truth is that it's not too difficult to move a strategy from zipline to Quantopian, so you can have the best of both worlds. Perhaps a simpler (i.e. more automated) way of doing this would be a good solution to both the debugger and optimization issues.

Dave

Grant Kiehne

Hi Dan & all,

I've posed the question before, but does anyone understand, fundamentally, why access to a clean and complete data set is expensive in the first place? In other words, we wouldn't be having this discussion if the data were free or dirt cheap.

A few thoughts and questions:

The Quantopian data set is updated daily, which suggests that the process for collecting and cleaning the data is fully automated. And in this age of "big data" infrastructure, the price should be dropping, right? So what's the story here?
My sense is that it would be in the best interest of the industry to subsidize the distribution of clean and complete data to retail traders. In other words, take a cue from Quantopian and expand the market by removing a barrier to entry. Is there any movement in this direction?
Has Quantopian's data vendor published, in detail, how they collect and reduce the data? Perhaps it's not a big deal, and Quantopian could reverse-engineer their system to build up an independent data set.
Who is the Quantopian data vendor? Last I heard, Quantopian would need permission from the vendor to use their name. Have you requested permission? If so, what was the response?

Grant

@Grant, you can get 5 min bars from Stooq that go back a month or hourly bars that go back 3 months.

All you have to do is remember to visit the site every month or 3. Oh and get a separate list of dividends/splits. Plus make sure the data is correct by validating it against other data sources.

Kidding aside, I think data providers have every right to charge for their service. And having a solid paid data service is part of the Quantopian value-add. It's unreasonable to ask them to give away their edge. They are already improving zipline which is open source. Don't ask them to do everything for free.

Grant Kiehne

Hello Dennis,

My basic point is that the data should be dirt cheap for Quantopian, since my assumption is that the whole collection and "cleaning" process is automated and highly efficient (but this assumption could be wrong). The fact that Stooq can provide free data not that different from Quantopian's suggests that I'm on the right track. I don't think that in the long run, the Quantopian value-add will end up being their data set (particularly since it ends up highly limiting the tools that can be applied to the problem...hence the use of Zipline offline by some members).

Grant

Jun 23, 2013

Grant, There's a lot of exchange fees around market data. I used to parse recorded level 2 and I promise you its very easy. (less than a week to build a parser)

I am not asking for free market data, but the price just keeps going up. I just wanted a trustworthy source. Historical market data is not Quantopian's value add.

Quantopian's value add will be building a platform for tradingIncluding other inputs. getting other data sources at reduced rates, etc. Actually supporting live trading is huge as live trading is a constant task and things will go wrong.

Grant Kiehne

Jun 23, 2013

Michael,

Interesting to hear that "the price just keeps going up." Intuitively, not what I'd expect, but perhaps it is a kind of monopoly situation with high demand.

Grant

Jun 23, 2013

That is exactly the problem. Its exchange fees that there is no way to get around.

For stuff like this, ecns will end up publishing or giving out intra-day data in exchange for trying produce more flow.

Jun 24, 2013

Michael and Dave, thanks for the thoughts. It matches some ideas I've got in my head.

Grant, it helps to look at the data problem in more than one part. The first part is historical data. There are a limited number of people who were recording stock data in, say, 2005 - you'd be amazed at how many companies just threw it away because they thought it had no value. There are relatively few players out there who have all the data. They have no interest in giving it away when they can charge for it instead.

The next part is live data. That's controlled by the exchanges, and you can only get access to it if you pay for it. (I know there are a few ways to scrape that data, but they're not reliable enough to trade on - at least not that I've seen).

Another part is data granularity. There are lots of uses for data with daily granularity: understanding the value of portfolios and uses like that. But there aren't a lot of uses for minute-level or tick-level data (at least, not until Quantopian came along!). There's not a lot of incentive for people to invest in creating that kind of data source because there isn't a lot of use in knowing what your IRA was worth at 2:52PM on March 22 - who cares?

Maybe in the future we'll come up with some hybrid model of 1) old data we paid for 2) medium-term data collected and distributed in an open source and 3) live data that we pay for. For the moment, the amount of data in (2) isn't big enough to be useful for us, especially when you can't do live trading without (3).

Disclaimer

Jun 27, 2013

Dan, a lot of people lost their historical recordings due to data agreements. Companies change, teams move, etc. Everyone I know has had a live collector for years.

it maybe also worth building a cacs and fundamental data product that people can buy/enable in quantopian. Bloomberg backoffice is a really expensive competitor product.

Saravanan Shanmugham

Jul 11, 2013

Let me add my thoughts to this thread.
Over the last week I have spent quite bit of time working on both Quantopian and Zipline.
I am an avid python developer, machine learning hobbyist and quant system developer for the past 10 years.
Trading solutions a product by Neurodimensions Inc. was my application of choice so far.
But if everything goes well Quantopian will be my platform of the future and you guys have made that choice easy for my.

Let me tell what I like about Quantopian.
1. I like everything about running my already developed algorithm in Quantopian
2. The fact that its in the cloud and I don't have to worry about maintaining my servers, downtime, etc.
3. I like very much the fact that its python based and the toolset, like pandas, numpy, etc.
4. I like that the data source is part of the service. Would certainly like to see more types of data as part of the service
5. I like the web as a way to monitor, start and stop my algorithms etc.
6. I like very much the idea of the community and sharing of algorithms
7. I like the idea of limiting the algorithm to see only past data for trading, which removes all possibility of me making an error and can getting misled by dubios signals that look ahead accidentally.

Here is a list of things I don't like or would love to see happen.
1. I will never develop new algorithms using the quantopian web interface. Atleast not for algorithms other than toy problems. That would be my absolute last choice. Write now I develop it using zipline(thank you very much for open sourcing it) and then manually porting it to quantopian
2. I would really love for zipline to be more fully functional as in atleast have an ipython/pylab/mathplotlib based equivalent of the backtester results, statistics, plot diagrams etc. Its sorta what I am doing right now. But visualizing the backtesting results is a bit of pain for now. for the lack of a standard GUI results window. I am trying to hack one right ow. Will try it push it back when I have something that I like.
3. When done developing algorithms, I would like a simple script/button that can I can push/pull/update the algorithm to/from zipline to/from quantopian
4. Would love to have access to the same data from zipline to allow me to develop the models on my desktop and not be forced to go to the web just for the data. I don't even have to have very current data, just equivalent but say a month out of data would be fine. Just for development/ML-training/backtesting purposes.
5. Developing the model can take quite a bit iteration and computation power if you are running machine learning algorithms. I would prefer to do it on the dekstop or be able to buy EC2 style compute power through quantopian. Atleast the development and trials of new algorithm on my dekstop with access to historical data that may upto a month out of date. And the push and run it in the cloud with more current data.
6. Once models are development they may need to periodcally be retrained and would love to see this done in the cloud itself. The way it is currently I don't see how it can done. batch_transform works for producing signals from trained ML-models, but is not the best way to train the ML models them selves. Ideally an api that would allow adaptive/learning code to run after market close or during weekends at specific intervals or on specific conditions(say when the model might be loosing touch with reality, training win/loss ratios or E(profit/loss) hit critical limits). It would be helpful to have access to future data for purposes of training/optimization purposes
7. I would really like to see a way of building/displaying/manipulating training/ideal signals that have visibility into the future to identify ideal buy/sell/hold points and be able to "DISPLAY" them on the graphs. This is also what is used for training ML models. Trading Solutions had it and it was awesome.
8. Though I ask for 7, I would like it to be clearly quarantined through a special decorator that so that it is only applied to a quarantined learning/adpative stage of the algorithm and not for active trading.
9. ability to share code between my algorithms through the form of a shared library that I can import from within my algorithm. But I could live with that ability just on my desktop where I can organize as libraries and the quantopian push/pull functionality could combine them before pushing to the cloud.

My 2 cents.
Sarvi

Jul 11, 2013

Great list Sarvi. A couple things ...

You can get 5 minute data from Stooq for your offline prototyping work. Zipline can already get daily historical prices from Yahoo.

There was another thread where Dan suggested they would like to have github integration. That would allow you to develop code offline and 'update' your Quantopian algorithm very easily.

Saravanan Shanmugham

Jul 11, 2013

yeah, I have the stooq data.
Not enough history for actually training in the 5 minute/1 hour bars for actual training. But never the less useful for development.

I missed one above. updated with 9