Given that a model successfully passes multiple out of sample tests and one decides to put their model through a live test, how would one determine whether or not the live test is a unique anomaly whether it performs great or poorly? What are some methods to make validation more robust?