Quantopian's community platform is shutting down. Please read this post for more information and download your code.
Back to Community
Live Custom Data -- What to expect?

Hi.
I read both of these well written explanations of self-serve data and live data.
https://www.quantopian.com/posts/quantopian-partner-data-how-is-it-collected-processed-and-surfaced
https://www.quantopian.com/help#self_serve_data

Alas, it's not 100% clear to me how my contest algos should expect to receive and process new(future) data when it becomes available. Perhaps it was the part about base tables and deltas that threw me off.

As an example. Let's say I have a dataset setup correctly in my datasets, it's linked to a published google sheets .csv(as per the docs) and my algo knows how to grab the historical data using pipeline.

The dataset looks like this:

date,symbol,my_value
5/31/2018,GCP,1
5/31/2018,EXPE,2
5/31/2018,PMT,3
5/31/2018,DOW,4
5/31/2018,SHO,5
5/31/2018,INGR,6
5/31/2018,QGEN,7
5/31/2018,AWK,8

Now, my algo goes live in the contest. It's 06/25/2018. At the end of this month I update the data file with similar, but different data.
For example:

date,symbol,my_value
6/31/2018,GCP,5
6/31/2018,EXPE,9
6/31/2018,IBM,3
6/31/2018,CAT,4
6/31/2018,TWTR,5
6/31/2018,INGR,1
6/31/2018,QGEN,17
6/31/2018,AWK,82

My first question would be how should the new data be arranged in the data file? Should the new data rows be appended to the bottom of the existing list? Can I just replace the old data with the new data? Does it matter?
On 07/01/2018 the algo ingest the updated data. What will come through the pipeline?
In any case, is it correct to say that the 'date' column will remain paired to its corresponding 'symbol' and 'my_value' like a regular row. If so, I can just sort the data on the 'date' column newest to oldest and do a groupby(etc.) to get the most recent batch of data.
Am I on the right track?

Thanks in advance.

Bryan

3 responses

Follow-up for clarity. In my example above, the two lists are different i.e. some symbols and values are different. When accessing the data after 6/31/2018 I only want the 6/31/2018 data. None of the previous 5/31/2018 data will be relevant at that point in time.
Any suggestions on how I would be able to accomplish that?
Thank you

Hi Bryan, Thank you for digging into the docs, let me see if I can clarify some details:

I was in the process of putting together the attached notebook, when you clarified your goal. You'll want to leverage the "Pipeline with additional recent update check" (the last example).

Pipeline will handle all the complexity of surfacing the proper point-in-time data to your algorithm, the base and delta tables are stored so you can explore the raw data if you have any questions.

Should the new data rows be appended to the bottom of the existing list? Can I just replace the old data with the new data? Does it matter?

The most general solution is to add any new and adjusted data to the existing file, the order doesn't matter to us as long as the header row comes first. If you aren't restating any old values (on 6/31 there we no updates to the 5/31 values), and you are sure your old values have already been processed, you could replace the old data with the new data to reduce the number of records if you are close to the google sheet total cell count limit.

Does it matter?

Since we are constantly comparing all the new live data against all the existing (T-1) data, limiting the live data to just the new and adjusted values can have a slight performance benefit in live load processing time. That said, monitoring for successful load via load_metrics before you update the file can be an additional burden. The key is making sure you know exactly what is changing with your data. We have had seen cases were data processing bugs were discovered after unexpected delta rows appeared in interactive data checks.

On 07/01/2018 the algo ingest the updated data. What will come through the pipeline?

Actually the pipeline won't get run until the following trading day (7/2/2018). Assuming you update the live data with the 6/31/2018 records above before 7am UTC on 7/2/2018, those new records will be accessible to pipeline on 7/2/2018.

What will come through the pipeline?

That depends on how you configure your pipeline . The attached notebook includes the sample data above shifted backwards a month (2018-04-30 and 2018-05-31). The first example shows how pipeline (with .latest) will forward fill last known values until a new value appears.

Any suggestions on how I would be able to accomplish that?

The second example leverages BusinessDaysSincePreviousEvent to add a has_recent_update term to the pipeline screen. This will surface only the data from the last update.

Hope that helps

Disclaimer

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

@Chris Myles. Thank you for taking the time to explain. I get it now!
By using your notebook as a guide, along with a bit of tweaking, my algos should be able to access my own self-serve data going forward.
The new self-serve data feature is a great one. Glad to be able to use it.
Cheers