Integrate Whisper model with `hg` evaluation CLI interface
What does this PR do? Please describe:
This PR adds integration for whisper model with the hg evaluation CLI interface.
In the process, I have refactored some functions to be more generic and support huggingface transformers API.
Demo:
$ fairseq2 hg asr /tmp/fairseq2 --config max_num_elements=20 model_name=openai/whisper-tiny.en dtype=torch.float32
[08/09/24 19:00:49] INFO fairseq2.recipes.hg.evaluator - Running evaluation on 1 device(s).
eval: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147
[08/09/24 19:03:37] INFO fairseq2.recipes.hg.evaluator - Eval Metrics - BLEU: 0.497937 | Elapsed
Time: 167s | Wall Time: 169s | brevity_penalty: 1.0 | length_ratio:
1.1557228282673901 | precisions: [0.676166231361788,
0.5496829119972201, 0.4500613362139993, 0.3675025152851946] |
reference_length: 52343 | translation_length: 60494
INFO fairseq2.recipes.hg.evaluator - Evaluation complete in 169 seconds
Does your PR introduce any breaking changes? If yes, please list them: Hopefully none.
Check list:
- [X] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
- [X] Did you read the contributor guideline?
- [X] Did you make sure that your PR does only one thing instead of bundling different changes together?
- [X] Did you make sure to update the documentation with your changes? (if necessary)
- [X] Did you write any new necessary tests?
- [X] Did you verify new and existing tests pass locally with your changes?
- [X] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)
I noticed that running the evaluation with max_samples > 20 on a 15GB VRAM setup was causing a CUDA out-of-memory error. To address this, I made improvements to better manage and free up memory. With these updates, you can now run the evaluation on the entire dataset, provided the batch_size fits within the available memory. I successfully tested this on the whole test split (approximately 2939 examples) using whisper-tiny and wav2vec2_asr_base_10h with max_num_elements=10.
Additionally, I found that the text data was being encoded only to be decoded again, so I eliminated this redundant step from the pipeline, along with making other refactoring adjustments.
Edit: max_num_elements can be increased significantly by defining max_audio_len as some audio samples were quite long and the whole batch gets padded to the longest sequence causing some batches to require a lot more memory than expected.
I will close this PR in favor of 2 smaller PRs
-
AsrDatasetConfigwhich will also include all the refactoring made in this PR along with a config for handling ASR Datasets. -
Whisper Integrationwill just include the whisper related code.
I will address all comments on the new PRs
I have addressed all comments in the following PRs #749 and #751