Back to Community

Upload Your Custom Datasets and Signals with Self-Serve Data

edited

Starting today, you can include custom datasets or signals in your algorithm for use in the contest and by extension, the funding process.

Self-Serve Data provides you the ability to upload your own time-series data to Quantopian and access it in research and the IDE directly via Pipeline. You can upload historical data as well as live-update your data on a regular basis via FTP, Google Sheets, or Dropbox.

Important Concepts

Your custom data will be processed similar to Quantopian Partner Data. In order to accurately represent your data in pipeline and avoid lookahead bias, your data will be collected, stored, and surfaced in a point-in-time nature. You can learn more about our process for working with point-in-time data in our forum post How is it Collected, Processed, and Surfaced? . A more detailed description, Three Dimensional Time Working with Alternative Data, is also available.

Once on the platform, your datasets are only importable and visible by you. If you share an algorithm or notebook that uses one of your private datasets, other community members won't be able to import the dataset. Your dataset will be downloaded and stored on Quantopian-maintained servers where it is encrypted at rest. Self-Serve data is considered "Private Content" in Quantopian's Terms of Use.

Example Notebooks

The Introduction to Self-Serve Data notebook (attached below) will show you how to format your data, upload it via Self-Serve Data and then import and use it in Pipeline.

The Self-Serve Data - How does it work? notebook (comment below) will explain how your data is processed, explain some considerations when creating your dataset, and finally how to check and monitor your dataset loads.

Additional Details

Our Self-Serve Data Help documents how to prepare, upload, access and monitor the loads of your data.

Initially, Self-Serve Data will support reduced custom datasets that have a single row of data per asset per day in csv file format which fits naturally into the expected pipeline format. Records that cannot be symbol mapped to assets in the Quantopian US equities database will be skipped. Optional live update files will be downloaded each trading day, between 07 to 10 am UTC, and compared against existing dataset records. Brand new records will be added to the base tables and updates will be added to the deltas table based on the Partner Data Processing logic.

We are always interested in learning more about your use cases and potential datasets, you can help us by filling out the Self-Serve Data survey .

Update: Learn more about Analyzing a Signal and Creating a Contest Algorithm with Self-Serve Data, in our new post which includes a Self-Serve Data pipeline example, a template Algorithm for incorporating your data, and shows you how to analyze your data using Alphalens.

If you have any questions or issues please contact us at [email protected]

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

17 responses

Chris Myles

Here is the Self-Serve Data - How does it work? notebook, that will explain how your data is processed, explain some considerations when creating your dataset and finally how to check and monitor your dataset loads.

Disclaimer

Karl Mun

Excellent! This opens up a whole new dimension to designing algorithms and IP regime management.

James Villa

This is a step in the right direction. I tried Self-Serve Data and it worked great. I uploaded some deep neural networks predictive signals created offline and added this as a alpha factor to the factors of my best ML algo and it did improve performance overall. Here's the notebook:

Joakim Arvidsson (Cream Mongoose)

@James,

Very impressive profitability hit rate of 54%! Is this because of the ML factor you reckon?

James Villa

@JA,

Thanks. Yes I think so, uploaded my predictive signals on SPY which was done offline using a deep neural network and combined it with other alpha factors and processed by another ML algo within the Q framework. More details are on this thread, if you're interested in the process: DNN and Beyond

Grant Kiehne

Hi Chris -

Some questions/comments:

What is your recommendation in reconciling point-in-time the off-line universe of symbols with the mapping to Quantopian SID? It would seem that some sort of daily price OHLCV correlation analysis would be required, to ensure that the off-line symbol is a true match.
Do you have a recommended (and hopefully vetted) source of free off-line daily OHLCV data? Is there any database that Quantopian has checked and reconciled? Or are we on our own on this one?
One use case would be to basically run an entire algo off-line and just upload the resulting alpha vector, for input to the Optimize API. However, for certain types of algos, knowing the relevant point-in-time universe is important. Any thoughts on this? Presumably, you don't plan to offer the QTradableStocksUS as a download, but this would seem to limit the usefulness of Self-Serve Data.
The CSV file format is fine for starters, but I'd think down the road, you'd want to support a compressed file format, too.
I'd recommend supporting upload file encryption (I don't mean the https stuff, but encryption of the file itself).
There should be support for creating files for "upload" from within the Quantopian research platform, including automated running of a daily script to update the file.

Karl Mun

Point 6 is an interesting one, Grant as potentially if one could generate alphas using QTU to upload the Custom Dataset from the Research Notebook, it would go some way to resolving points 3 and 5 as well.

Chris Myles

@James ... awesome, glad it worked out for you. Extra thanks for updating your previous post to show how to leverage self-serve data.

@Grant Will you create separate post specific to off-platform ML (via self-serve)? We expect that the Q community will come up with 100s of potential use cases for the Self-Serve Data feature, I want to make sure that this post isn't dominated by discussion of a single (ML) use case.

We will be rolling out more examples and details including considerations for symbology. Note: we've create a self serve data tag to make it easier to tie the posts together.

1 and 2) To clarify, we don't expect you to map off-line symbols to SID, that will be handled by self-serve processing via the ticker as of the primary date for each record. Self-Serve leverages the same symbology mapping we use for our data vendors. In the future we will be rolling out more options for symbol mapping specific to self-serve data. 3,4, and 5) noted

6) We are working on a new API that will allow you to reduce existing data and combine datasets within the Quantopian platform that should help cover some of the current pain-points. As plans stabilize, we'll share them with the community and see if there are additional pain-points that could leverage self-serve too.

Disclaimer

Grant Kiehne

@ Chris -

Will you create separate post specific to off-platform ML (via self-serve)? We expect that the Q community will come up with 100s of potential use cases for the Self-Serve Data feature, I want to make sure that this post isn't dominated by discussion of a single (ML) use case.

I removed my reference to ML. This would seem to be an important use case, so if you or someone else at Q (Thomas W.?) were to start a separate post, with your plans to support it via the Self-Serve Data feature, I'm sure it would be of interest to the community.

Chris Myles

@ Grant - There is no need to remove your reference to ML, as you mentioned it will be an important use case that can be supported via Self-Serve Data. It just seemed like half of your questions were specific to the off-platform data generation details (vs ingestion) which seems like a great topic for another post. BTW I think some of the existing community members may have more up-to-date experience with the off-platform ML specifics than we do, but I'll make sure it is in queue of Self-Serve Data topics to cover.

Disclaimer

Bala Vignesh

Excellent stuff..

Lucy Wu

To learn more about the Self-Serve workflow, check out the Data Processing and Alphalens Study example notebooks. The Data Processing notebook demonstrates the steps necessary to transform a raw dataset into a dataset suitable for upload to Self-Serve; the Alphalens Study notebook demonstrates how you can upload your Self-Serve data and analyze it in the research environment using pipeline/Alphalens.

Disclaimer

Grant Kiehne

Starting today, you can include custom datasets or signals in your algorithm for use in the contest and by extension, the allocation process.

Hi Chris,

Thinking this through, in the end, you could have ~$10M of capital running on a Rasberry Pi in a mud hut in the middle of nowhere, powered off of a battery charged by a solar panel (probably not likely, but I'm illustrating with a potential extreme case). For algos that use the Self-Serve Data API, and receive an allocation, will there be any kind of infrastructure compliance requirements? Obviously, it would be in the best interest of the quant to have a solid system, but I'm curious if there will be any specific requirements, and how might they be audited?

Chris Myles

@Grant Since our allocation process includes a 6 month out-of-sample evaluation period, we will be able to monitor the dataset's historical load_metrics for errors, unexpected deltas, and gaps in data delivery during that period.

Disclaimer

Grant Kiehne

Hi Chris -

I'd recommend continuing to noodle on the issue of infrastructure a bit more. It is not obvious from a variety of angles how this will work for live, real-money trading. For example, say a significant fraction of your 195,000 users receive an allocation (say 1,000). And 20% of them use the Self-Serve Data API. You'll have 200 potentially unique offline implementations to manage (e.g. connectivity glitches, power outages, pc crashes, viruses, hackers, etc.). It would seem that Quantopian would want to support some level of infrastructure, in the end. Personally, for offline factor computation, I'd much prefer starting with a Quantopian-approved/hosted instance (and would be willing to pay a small, monthly competitive fee for it...and if it supported real-money trading, even better). Maybe Unhedged or Alpaca would fill this gap? What are your thoughts?

James Villa

@Grant,

These are good and important points you raised regarding live implementation. It would be ideal if we can all do these things within Q infrastructure and framework. The onest is on Q to provide these capabilities to their "worker bees" if they want to catapult by leaps and bounds as competition in this space is only going to get tougher. And "worker bees" will flock to the best environment out there.

Robert Petteruti

Learn more about analyzing signals and preparing an algorithm for the daily Quantopian Contest with Self-Serve Data in this forum post.
The post contains the following:
1. A notebook that loads an example dataset into Pipeline and analyzes it using Alphalens.
2. A template algorithm to clone and build off of with your uploaded data. Follow the TO-DOs throughout the template as you develop the algorithm for the contest. A backtest of the template using an example dataset is also attached.

Also, watch this recently-released video, Upload Your Custom Datasets with Self-Serve Data, for another high-level walkthrough of the tool.

Disclaimer

You've successfully submitted a support ticket.

Our support team will be in touch soon.