The first bug is that run_pipeline sometime returns values for dates after the requested ones.
The behavior here is that when you call run_pipeline(pipe, start_date, end_date)
, if start_date
or end_date
are not trading days, they're rolled forward to the next trading day in the US Equity calendar. This is intended, since it's the only behavior that makes it easy to run single-day pipelines (a common task when debugging or testing a new filter/factor/classifier) without the user having to know the historical trading calendar by heart.
Consider the case where a user wants to run a pipeline for the first day of 2016. The obvious thing to write here is something like run_pipeline(pipe, '2016-01-01', '2016-01-01')
. As it turns out, however, the first trading day of 2016 was January 4th, so we have a few options for what we can do:
- Return an empty dataframe.
- Raise an exception informing the user that there are no trading days between the requested dates.
- Roll the start and end dates to the next or previous trading day and then compute.
Option 1 is likely to just confuse users, and Option 2 forces users to know the historical trading calendar by heart in order to use run_pipeline
without errors, so Option 3 seems like the most friendly behavior for a function that will be invoked interactively. Within Option 3, we have a few possible choices:
- Roll both
start_date
and end_date
forward.
- Roll both
start_date
and end_date
backward.
- Roll
start_date
backward and end_date
forward. (This gives the largest possible window).
- Roll
start_date
forward and end_date
backward. (This gives the shortest possible window).
Option (4) seems like the natural choice, but that doesn't solve the problem in the case that there are no trading days between start_date
and end_date
, so we'd still end up blowing up or returning an empty result in many cases.
Option (3) has the confusing behavior that run_pipeline(pipe, '2014', '2014')
would return two days of data rather than just one.
That leaves Option (1) and Option (2). The choice here is more or less arbitrary. I think rolling forward (Option (1)) is slightly more intuitive because it has the nice property that run_pipeline(pipe, '2014', '2014')
still gives you data from 2014, rather than rolling back to a previous year.