Hey Everyone,
Today, we wanted to update you on some improvements we've made to the performance and scalability of our custom dataset integration. If you are not yet familiar with Self-Serve Data and you would like to upload your own time-series data to Quantopian for use with Pipeline, check out the documentation here.
Since the Self-Serve Data introduction almost two years ago, thousands and thousand of custom datasets have been added to the Quantopian platform. But with increased usage came the growing pains of an overloaded system. How many of you added a self-serve dataset to your analysis, only to have your pipelines slow down to "coffee break" speed?
Faster, Bigger, and More
- Greatly improved pipeline runtime performance of self-serve datasets, we've seen speedups up to 10-20x on large datasets.
- We've increased the maximum file size from 300MB to 500MB. (Enterprise clients from 500MB to 6GB)
- We've increased the maximum number of Self-Serve datasets from 30 to 50.
Auto-migration
In order to minimize the potential change to existing analysis and pipelines, we have migrated all of your (non-error) datasets from the old system to the new. You won't have to change any of your existing self-serve imports to use the improved functionality.
Note: old datasets will be available for comparison for about a month at an old namespace, as we work through migration options for anyone with active live datasets.
from quantopian.pipeline.data.old.<User|Org ID> import <DataSet>
Improvements
By leveraging the same technology stack that our pre-integrated datasets use, we've been able to improve the coverage of the self-serve symbology mapping step and NaN data handling. Now, when a user explicitly provides a NaN data value, we will process that as a valid data point. In the past, we forward filled previous data over all NaN values, including a user supplied data point.
Note: you will need to re-import a dataset to trigger the symbology mapping step. Migrated datasets are designed to migrate the exact symbology data from the original system.
Self-serve data validation has also been improved by adding explicit date, datetime and boolean type formats.
Two new research APIs to analyze Self-Serve data
Note: you can use shift-tab in research to see more details about the signature and docstring
query_self_serve_data
Now you can query the underlying raw symbol mapped data for a new self-serve dataset. The following example returns all columns for the given DATASET, for the US domain from the starting asof date (None ) through 2002-10-01.
from quantopian.research.experimental import query_self_serve_data
query_self_serve_data(DATASET.columns,'US',None,'2002-10-01')
You can also pass a subset of columns
query_self_serve_data([DATASET.asof_date,DATASET.sid],'US')
query_self_serve_failed_symbology
Will return a list of primary asset identifiers that could not be successfully mapped to assets in the Quantopian system, with the minimum and maximum asof_dates for each asset.
from quantopian.research.experimental import query_self_serve_failed_symbology
query_self_serve_failed_symbology(DATASET)[['identifier','min_asof','max_asof']]
If you have any questions or issues please contact us at [email protected]