This is a timely discussion, as one of our current focuses is improving/correcting the metrics mentioned in this thread.
(To follow along with the risk metrics, the calculations are done here, https://github.com/quantopian/zipline/blob/master/zipline/finance/risk.py)
One of the main corrections needed is that our risk calculations are tailored towards the 1/3/6/12 month reports. This means that our volatility currently has a denominator that is suited towards those time frames, but is not normalized to 252 days for the headline number, which is probably why the numbers are not as you expect @Stef42.
Likewise, the Sharpe ratio does not have a good calculation of the expected value against the risk-free rates, it should be using the mean of the difference and not just the difference between the two. (And as @Ben McCann has astutely pointed out we should convert the risk free to 10 year periods.)
We've also been working on getting an Excel spreadsheet, which we've used to corroborate the Zipline risk calculations, more integrated into the testing suite, as well as exposing said spreadsheet to the public for verification/scrutiny. We're hoping that having the two different implementations, and being able to verify each independently, will help tent-pole the correctness of the risk calculations.
Here is a link to the current version, https://s3.amazonaws.com/zipline-test-data/risk/3ac0773c4be4e9e5bacd9c6fa0e03e15+/risk-answer-key.xls
We're hoping to have some annotations for that Excel spreadsheet in the form of an IPython notebook, soon. The answer key is now being used to power the unit tests found here, https://github.com/quantopian/zipline/blob/master/tests/risk/test_risk.py. As we improve our risk calculations, we will be updating the spreadsheet accordingly.
Transparency of the backtester is a big goal of ours, including exposing how it works and its known deficiencies, but sometimes issues end up solely on the internal issue tracker. I'll double check to make sure these issues are on the external issue tracker, and if not, copy tickets about Sharpe, normalization, etc. over to https://github.com/quantopian/zipline/issues as well, and link to them from here.
(Also, @Daniel Sandberg, for your concern about the difference between the benchmark and the algo returns, the difference between the two is addressed here, https://quantopian.com/posts/algorithm-at-a-disadvantage, the TLDR; is that currently for backtesting, the benchmark on the graph is S&P, whereas the algo buys into the SPY ETF.)