Alpha Testers Wanted - Upload Your Own Data to Pipeline

Back to Community

edited

With the My Data feature you can upload your own data to Quantopian and access it in research and the IDE via Pipeline. You’ll be able to add historical data as well as live-updating data on a nightly basis.

Your custom data will be processed similar to Quantopian Partner Data. In order to accurately represent your data in pipeline and avoid lookahead bias, your data will be collected, stored, and surfaced in a point-in-time fashion. You can learn more about our process for working with point-in-time data here: How is it Collected, Processed, and Surfaced? . A more detailed description, Three Dimensional Time Working with Alternative Data, is also available.

Once on the platform, your datasets are only importable and visible to you. If you share an algorithm or notebook that uses one of your private datasets, other community members won't be able to import the dataset.

Dataset Upload Format

Initially, My Data will support “reduced” custom datasets. A reduced dataset has no more than one row of data per asset per day. This format fits naturally into the expected pipeline format. Data should be in .csv format (comma-separated value) with a minimum of three columns: one primary date, one primary asset and one or more value columns. The value column(s) is generally the data that you want to use in your research or algorithm.

During the historical upload process, you will need to map your data columns into one of the data formats below.:

Numeric- Values. Note that these will be converted to floats to maximize precision in future calculation.
String- Textual values which do not need further conversion. Often titles, or categories.
Datetime- Date or a date with a timestamp. An additional date column that can be examined within the algorithm or notebook.
Bool - True or False values

Two Data Column CSV Example:

Below is a very simple example of a .csv file that can be uploaded to Quantopian. The first row of the file must be a header, and columns are separated by the comma character (,).

date,symbol,signal1,signal2  
1/1/14,TIF,1204.5,0  
1/2/14,TIF,1225,0.5  
1/3/14,TIF,1234.5,0  
1/6/14,TIF,1246.3,0.5  
1/7/14,TIF,1227.5,0  
…..

We will be adding members to our closed alpha shortly. Please tell us more about your data and your use case here.

If you have any questions, feel free to contact us at [email protected]

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

14 responses

Blade Nelson

Thanks for this update! This will be super useful for in-house projects.

Grant Kiehne

Hi Chris -

Is there any way you could provide the historical and current point-in-time symbols for the QTradableStocksUS as a download (e.g. from an FTP site)? This way, any off-line data gathering/analysis could be limited to the relevant stocks for your current needs. Or would this run afoul of licensing agreements you have with data providers (even though you'd just be providing the symbols)?

Grant Kiehne

Another question is I don't understand the idea of using symbols in the file to be uploaded, when the Q system uses SIDs. Could you explain? For a given symbol, point-in-time, how would one know what SID it maps to? Or maybe I don't understand the symbol versus SID thing?

Chris Myles

Hi Grant,

We maintain internal point in time references for our symbology (symbol to SID) so that we can easily map external historical data. In this case, we use the primary date column to map the symbol name to it's proper SID. This is the same methodology we utilize for other data vendors who do not have access to our internal SID.

The historical point-in-time QTradableStocksUS download idea is interesting and something we can review for the future roadmap. For now, we suggest gathering and importing a larger dataset and filtering it down to the the QTradableStockUS via pipeline if necessary.

As we get more adoption and usage, we'll work with the alpha community to identify which features and ideas would be most valuable to focus upon next.

Disclaimer

Grant Kiehne

Thanks Chris,

A few questions:

What happens if you can't map the symbol name to it's proper point-in-time SID? Will the code crash, or will that symbol be ignored? If the latter, how would one know that there was a name-SID mismatch? Or is there notification (e.g. via automated e-mail)?
I'm not clear why you would not want to provide the point-in-time QTradableStockUS symbols (and perhaps SIDs). Essentially, with access to an external source of data, I can envision the many users who have been interested in ML/deep learning to be able to write algos for you to evaluate for the contest and fund. Without the base universe, it seems like offline algos would be less effective. It's your call, but it would seem very straightforward to put the point-in-time QTradableStockUS symbols out on an server for download. But I guess if there is limited interest, then why bother. I'd be curious what other users think of this idea...
I'm confused about what you are trying to test here. Your questionnaire seems to be geared toward users uploading valid data sets for trading, versus just cooking something up to test the limits of the system. It seems like thorough testing of the system would be best done with simulated data sets, checking various corner cases. Do you want software testing, or are you doing a survey to determine how much interest there might be? If there former, are there cases your engineers would like to see tested?

Grant Kiehne

Another comment:

I'd think that at a basic level, the hardware/software functionality is similar to that used for you data vendors. Or is this a different system? It would seem that you'd just have users/quants and vendors use the same API, right? You've had the API running for vendors for a long time, so supposing you are opening that API up to users, what remains to be tested that you haven't already checked with your vendors? Just trying to follow...

Chris Myles

Hi Grant,

Let me start with your last questions: the new functionality is similar to the process we use with data vendors with one very big difference; we've introduced a new user interface to upload the historical data, define the data types for each column and setup live uploads. With previous data vendors, we chose established vendors and datasets, analyzed the data, defined the data types and managed both the historical and ongoing live update process.

Note: the goal of the alpha is not to stress test the software, we've done lots of internal testing with all the possible variations we can think of. We want to make sure a Quantopian community member can accomplish all of the above steps in a self-serve fashion. We may need to add more training material, example notebooks or additional features to support all the use cases and datasets that the community is interested in uploading.

To answer your specific questions:
1. The symbol will be ignored. We provide additional load metrics for each dataset, so that you can determine if any input rows have been ignored.
2. We weigh our roadmap priorities based on overall community impact not how easy or straightforward features are to implement. For ML/deep learning, we have seen some success with various signals generated off platform that are integrated into a "basic" optimize algorithm that manages the universe and risk constraints.
3. See my comments above re: what we are trying to test. The questionnaire also includes some general questions that may help guide future priorities. ex: If the main blocker is that the community doesn't have access to a FTP server (like a standard data vendor would), we need to address that sooner rather than later. If the common use case is they want to collaborate or share a dataset, we can prioritize that. We're also interesting in understanding the types of non-standard/uncommon data sets people are interested in exploring.

I encourage anyone who is interested in data on Quantopian to fill out the questionnaire, even if you don't have a target dataset or use case in mind (yet).

Disclaimer

Adam Moore

If possible, I'd like to be able to import pandas_datareader to funnel FRED datasets directly into pipelines instead of having to use quandl as a middle-man or having to upload my data manually as a csv.

For example, let's say I read the risk factors section of the 10K filed by Starbucks and identified data about arabica prices and fluid milk prices as potential signals relevant to that stock and want to import that data from FRED instead of commodity prices for some reason.

Increases in the cost of high-quality arabica coffee beans or other commodities or decreases in the availability of high quality arabica coffee beans or other commodities could have an adverse impact on our business and financial results

We purchase, roast and sell high-quality whole bean arabica coffee beans and related coffee products. The price of coffee is subject to significant volatility and has and may again increase significantly due to one or more of the factors described below. The high-quality arabica coffee of the quality we seek tends to trade on a negotiated basis at a premium above the “C” price. This premium depends upon the supply and demand at the time of purchase and the amount of the premium can vary significantly. Increases in the “C” coffee commodity price do increase the price of high-quality arabica coffee and also impact our ability to enter into fixed-price purchase commitments. We frequently enter into supply contracts whereby the quality, quantity, delivery period, and other negotiated terms are agreed upon, but the date, and therefore price, at which the base “C” coffee commodity price component will be fixed has not yet been established. These are known as price-to-be-fixed contracts. The supply and price of coffee we purchase can also be affected by multiple factors in the producing countries, such as weather (including the potential effects of climate change), natural disasters, crop disease, general increase in farm inputs and costs of production, inventory levels and political and economic conditions, as well as the actions of certain organizations and associations that have historically attempted to influence prices of green coffee through agreements establishing export quotas or by restricting coffee supplies. Speculative trading in coffee commodities can also influence coffee prices. Because of the significance of coffee beans to our operations, combined with our ability to only partially mitigate future price risk through purchasing practices and hedging activities, increases in the cost of high-quality arabica coffee beans could have an adverse impact on our profitability. In addition, if we are not able to purchase sufficient quantities of green coffee due to any of the above factors or to a worldwide or regional shortage, we may not be able to fulfill the demand for our coffee, which could have an adverse impact on our profitability.

We also purchase significant amounts of dairy products, particularly fluid milk, to support the needs of our company-operated retail stores. Additionally, and although less significant to our operations than coffee or dairy, other commodities, including but not limited to tea and those related to food and beverage inputs, such as cocoa, produce, baking ingredients, meats, eggs and energy, as well as the processing of these inputs, are important to our operations. Increases in the cost of dairy products and other commodities, or lack of availability, whether due to supply shortages, delays or interruptions in processing, or otherwise, especially in international markets, could have an adverse impact on our profitability.

import pandas_datareader.data as web  
import datetime

start = datetime.datetime(2010, 1, 1)  
end = datetime.datetime(2013, 1, 27)

arabica = 'PCOFFOTMUSDM' # Global price of Coffee, Other Mild Arabica  
fluid_milk= 'PCU3115113115111' # Producer Price Index by Industry: Fluid Milk Manufacturing: Fluid Milk and Cream, Bulk Sales

my_data = web.DataReader([arabica, fluid_milk], 'fred', start, end)  
# TODO: Put my custom data into a pipeline

Grant Kiehne

Thanks Chris...I'll put some input into the form when I get the chance.

Chris Myles

Hi Adam, Thanks for the pointer to pandas_datareader, we are always looking for new sources of data for the community. In order to avoid lookahead bias, we currently require point-in-time data capture for datasets that we surface in pipeline.

I took a quick look at the resulting my_data dataframe and it doesn't look like it supports point-in-time changes in the FRED datasets. If you take a look at the attached notebook you'll see the updates that we've captured since we started processing similar data from Quandl. The timestamp represents the datetime we captured the updated values.

The My Data roadmap does include future support for API downloads. This is will help automate the live update process, while capturing the point-in-time nature of the data. Since your source data is not changing frequently, I suggest you start with a historical upload to test your theories. It looks like the web.DataReader results surface NaN for missing values in the joined results (ie. PCOFFOTMUSDM after 2017-06-01) , my_data.fillna(method='ffill') will forward fill previous values to match our expected format.

Disclaimer

Grant Kiehne

Hi Chris -

Some more feedback:

If you have a My Data roadmap, I recommend simply publishing it for comment.
One use case would be to upload historical daily OHLCV bars for comparison with the Q data set, across all symbols (using the Q research platform). I don't have a sense for how much data this would be (e.g. going back to the start of your data, which I think extends to 2002). The idea would be to write a script to suss out gross discrepancies, and bring the two data sets into alignment. This way, any offline computations using the offline data set could be done (e.g. using zipline/zipline-live) with a one-to-one correspondence to computations done on the Q platform. Would this be feasible?
Why did you select an uncompressed CSV format? I'd think that a standard compressed/binary format would be preferred.
"During the historical upload process, you will need to map your data columns into one of the data formats below" - Is the mapping done with a file header, or outside the file? It would seem that the mapping should be contained within the file. Using your example, the header would contain something like: date:datetime, symbol:string, signal1:numeric , signal2:numeric.
I recommend publishing a spec. (and perhaps the code) for precisely how you are doing the conversions (including the acceptable inputs, e.g. for Boolean, are you only accepting True and False, or 1 and 0, as well? How to input inf and -inf? Nan? etc.) .
On the questionnaire form, you say "Total number of historical records? Note: we currently have a 300MB file upload limit." Is this the limit for upload? Or the total storage space allocated per user? Also, upon upload are the files compressed? Or will a 300MB file consume 300MB on your server?
For live trading, when would the data need to be uploaded for compatibility with your current system, relative to the market open?
A compelling use case in my mind would be offline ML, however, I think for this to be generally effective, the stock universe is required (e.g. the QTradableStockUS). Aside from ML, there may be other computations where being able to limit the universe to what is specifically required for the Q contest/allocations would be necessary. What is your perspective? Do you see a use case for offline ML?
Assuming a future world where many data sets are readily available and free/low-cost, this new API could change the character of Q, in that users would simply do their analysis and run algos offline, and only upload the buy/sell signals. Is this a correct assessment? I'd be interested in your perspective on where you see this heading.
From a quality control and regulatory standpoint, I'm kinda skeptical that this will fly in its current form for your hedge fund. I'd think you'd need something more robust from a technical standpoint, and auditable. Any thoughts?

Josh Payne

Hi Grant,

Thanks for your comments. Some interesting questions. At a broad level, we're looking for folks to try out this feature and provide feedback so we can learn from a diversity of opinions, needs, and experiences. Many of these speculative questions will be answered for the folks who try the feature out and the answers for many of these questions might change as we learn from the early experiences of users.

All the best,
Josh

Disclaimer

Milap Naik

I signed up a few days back. When can we expect to be able to try this feature out?

Chris Myles

Hi Milap,

We will be starting to roll out the access to select Alpha Users this week. Based on feedback and style of target datasets, we will increase the rollout numbers over the next 2-3 weeks.

Disclaimer

You've successfully submitted a support ticket.

Our support team will be in touch soon.