@James O'Brien: Thanks for finding this! There is currently a problem with a set of categorical fields, including profitability grade. The full list of affected fields is:
financial_health_grade
financial_growth_grade
profitability_grade
company_status
industry_template_code
share_class_status
For now, please continue fetching these fields through the old API. I will post an update when the fix has been deployed.
@Costantino: I agree that the current API makes it cumbersome to work with quarterly or lower frequency data. While it may seem like the current API is inefficient, due to the way pipeline is implemented this is not the case.
In algorithms, pipelines are computed in 6 month chunks. This means that every 6 months we fetch all data and perform all of the computations needed to produce the next 6 months of output values. One of the biggest costs in Pipeline is retrieving the input data so we try to read data in large batches. Querying for data in large batches reduces the number of times we need to go to the disk or database, which has a high constant cost regardless of the amount of data being read.
Imagine that we had an API that presented the user with the current value and the trailing quarter's value for some field. On the first day of the computation, when today = 2016-03-01
, we would need to have read exactly two rows: the current day's value (2016-03-01
) and one quarter ago's value (2016-01-01
). The table on the left shows the raw source data, and the table on the right shows the slice of data that is presented in a custom factor.
On the second day of computation, when today = 2016-03-02
, we would need to read two more rows. This means we have read four rows in total: 2016-01-01
and 2016-03-01
from the previous computation and 2016-01-02
and 2016-03-02
from the current day's computation. Again, the table on the left shows the raw source data and the table on the right shows the slice of data that is presented in a custom factor.
If you repeat this process for at least one quarter, we will have read every row from 2016-01-01
to 2016-03-01
. Because we know that all of these values will be used eventually, it is more efficient to query for them as one dense block. In theory, we could hold only the two rows in memory at a time; however, the time cost of reading the data a few rows at a time would make this infeasible.
In research, users may choose to run smaller windows which would not require every value from 2016-01-01
to 2016-03-01
. The optimization of querying in a dense block still holds because it is more efficient to read a contiguous block of data than to do random access. We would spend more time determining which rows to filter than we would just reading all of the rows. Even if we could build an indexing scheme to more efficiently read these non-contiguous regions, the absolute time saved when querying for less than one quarter of data would be fractions of a second, and the ram cost of a few rows is negligible.
Hopefully this helps explain why using a long lookback window may be as efficient as querying for trailing quarters.
@Ian Worthington: I apologize for the inconvenience, a few of the fields have been renamed slightly:
manual_renames = {
'dividend_yield': 'trailing_dividend_yield',
'dividend_rate': 'forward_dividend',
'diluted_eps': 'diluted_eps_earnings_reports',
'basic_eps': 'basic_eps_earnings_reports',
'basic_eps_other_gains_losses': (
'basic_eps_other_gains_losses_earnings_reports'
),
'diluted_eps_other_gains_losses': (
'diluted_eps_other_gains_losses_earnings_reports'
),
'basic_average_shares': 'basic_average_shares_earnings_reports',
'diluted_average_shares': 'diluted_average_shares_earnings_reports',
'average_dilution_earn': 'average_dilution_earnings',
'basic_extraordinary': 'basic_extraordinary_earnings_reports',
'normalized_basic_eps': 'normalized_basic_eps_earnings_reports',
'diluted_discontinuous_operations': (
'diluted_discontinuous_operations_earnings_reports'
),
'diluted_continuous_operations': (
'diluted_continuous_operations_earnings_reports'
),
'basic_accounting_change': (
'basic_accounting_change_earnings_reports'
),
'continuing_and_discontinued_basic_eps': (
'continuing_and_discontinued_basic_eps_earnings_reports'
),
'diluted_extraordinary': 'diluted_extraordinary_earnings_reports',
'tax_loss_carryforward_diluted_eps': (
'tax_loss_carryforward_diluted_eps_earnings_reports'
),
'tax_loss_carryforward_basic_eps': (
'tax_loss_carryforward_basic_eps_earnings_reports'
),
'basic_discontinuous_operations': (
'basic_discontinuous_operations_earnings_reports'
),
'continuing_and_discontinued_diluted_eps': (
'continuing_and_discontinued_diluted_eps_earnings_reports'
),
'basic_continuous_operations': (
'basic_continuous_operations_earnings_reports'
),
'normalized_diluted_eps': 'normalized_diluted_eps_earnings_reports',
'dividend_per_share': 'dividend_per_share_earnings_reports',
'diluted_accounting_change': 'diluted_accounting_change_earnings_reports',
}
The _earnings_report
suffix denotes this attribute is about an entire company, and we may have an attribute of the same name for each share class.
The other renames are to clarify which direction a field is looking, for example: 'dividend_yield': 'trailing_dividend_yield'
. We also have forward_dividend_yield
so we wanted to clarify what these fields meant.
I am currently looking into the other issues posted in this thread. Thank you all for the great feedback and testing!