Statistical significance testing
As mentioned in a recent paper about evaluating MT approaches (and probably other sources too), statistical significance testing can be used to confirm that one method is superior to another, saying that 'it remains one of the most cost-effective tools to check how trustworthy a particular difference between two metric scores is."
We could possibly use the Wilcoxon signed-rank test (implemented in scipy) or another similar approach.
I think this will be a valuable addition to this library; I'd be happy to work on this to add significant testing. There are already a few libraries that implement these features. I've utilized deep-significance in my work and found it extremely reliable and easy to use.
@lvwerra, can I have your input on this request?