badhrisuresh
badhrisuresh
This issue is caused due to some randomness in rouge score code (in evaluate repo) and I fixed it by setting numpy random seed in the script. Please take a...
Ideally, they should be deterministic as they are F-1 scores of different n-grams. I'm looking at an existing issue in their repo and will update once I test the actual...
I found this issue [here](https://github.com/huggingface/evaluate/issues/186) that talks about the same problem. They enable the BootstrapAggregator by default in the code which does random sampling to compute confidence intervals which causes...
Updated the README in the [PR](https://github.com/mlcommons/inference/pull/1386) - Modified the repo name and added reference model ROUGE scores.
We are still working on publishing the fine-tuned model publicly. But we have already shared the checkpoint internally with the task force which you can try.
We have always used validation set and not the test set for MLPerf Inference benchmarking. I have removed the redundant code from download_cnndm.py and also updated the max_examples in main.py