π [Metric Request] WOOD score
WOOD score paper : https://arxiv.org/pdf/2007.06898.pdf
Abstract :
Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and βhackβ datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance β and thus overestimation in AI systemsβ capabilities β we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.
Is this being worked on? If not, I'd like to try! I can do this by following the directions outlined here, correct?
Hi @kasmith11, I don't think anybody is working on it right now. Following the guide will create a community metric (i.e. one you can load with load("kasmith/wood"). But to make it an official metric maintained in evaluate we can simply move the code into metrics/ after, so it's a good start and you can test it without needing to merge a PR :)
i would also like to work on this one. [new guy here]
I'm very open to collaboration! If you're interested, we can work together on this @sezan92. Would that change anything you outlined above @lvwerra?
Sure, if you'd like to collaborate that would be a good issue :) For communication you could join our Discord: https://huggingface.co/join/discord
@kasmith11 sorry for late reply. sure. how would you like to begin ?
Hi @sezan92, I took an initial pass at implementing WOOD score here after reading the paper. I haven't gotten a chance to test the implementation or fill out any of the documentation.
I think testing/debugging and documentation are the next steps.
Are you in the huggingface discord linked above? I think that would be a great place for us to communicate via chat going forward.
@kasmith11 yes i just joined. my username is sezan92
Fantastic @sezan92. I'll reach out to you via discord soon.
I have a repository with an implementation of WoodScore here. I've had more time to dedicate to this if you are interested still @sezan92