Hi folks,
Arshdeep has it correct in linking to Dan's post on scoring changes. The consistency score is arrived at from creating distributions of daily returns data and then computing the overlapping region under the curves.
You can read more detail from Dan's post linked above, and you can nerd out on the Gaussian KDE function if you like. But in the extreme of having many daily returns data points for both the backtest and the paper trading period the distribution function shouldn't be all that critical.
This method is by no means perfect. One weakness will show up when there are very few days of paper trading data to fit the 'out of sample' distribution, in which case the fit might be overly generous. It should, however be a lot better than just comparing the annualized returns for the backtest and paper traded results, as the annualized returns for the paper traded results of just a few days, or even weeks can be extremely noisy.
We are working on an open source risk library that will expose this calculation along with all the other ones we're using to evaluate algorithms in the contest. In the meantime, Justin shared a clone-able notebook last week that has a bunch of research code you can reuse. The function we use to compute consistency is called out_of_sample_vs_in_sample_returns_kde. You can clone that notebook and use it directly in the research platform, or if you just want to see the function I've included it below along with a sample plot for a visual of what an 0.85 consistency score looks like.

def out_of_sample_vs_in_sample_returns_kde(bt_ts, oos_ts,
transform_style='scale',
return_zero_if_exception=True):
bt_ts_pct = bt_ts.pct_change().dropna()
oos_ts_pct = oos_ts.pct_change().dropna()
bt_ts_r = bt_ts_pct.reshape(len(bt_ts_pct),1)
oos_ts_r = oos_ts_pct.reshape(len(oos_ts_pct),1)
if transform_style == 'raw':
bt_scaled = bt_ts_r
oos_scaled = oos_ts_r
if transform_style == 'scale':
bt_scaled = preprocessing.scale(bt_ts_r, axis=0)
oos_scaled = preprocessing.scale(oos_ts_r, axis=0)
if transform_style == 'normalize_L2':
bt_scaled = preprocessing.normalize(bt_ts_r, axis=1)
oos_scaled = preprocessing.normalize(oos_ts_r, axis=1)
if transform_style == 'normalize_L1':
bt_scaled = preprocessing.normalize(bt_ts_r, axis=1, norm='l1')
oos_scaled = preprocessing.normalize(oos_ts_r, axis=1, norm='l1')
X_train = bt_scaled
X_test = oos_scaled
X_train = X_train.reshape(len(X_train))
X_test = X_test.reshape(len(X_test))
x_axis_dim = np.linspace(-4, 4, 100)
kernal_method = 'scott'
try:
scipy_kde_train = stats.gaussian_kde(X_train, bw_method=kernal_method)(x_axis_dim)
scipy_kde_test = stats.gaussian_kde(X_test, bw_method=kernal_method)(x_axis_dim)
except:
if return_zero_if_exception:
return 0.0
else:
return np.nan
kde_diff = sum(abs(scipy_kde_test - scipy_kde_train)) / (sum(scipy_kde_train) + sum(scipy_kde_test))
return kde_diff