fairseq2 Integrate Whisper model with `hg` evaluation CLI interface

What does this PR do? Please describe: This PR adds integration for whisper model with the hg evaluation CLI interface.

In the process, I have refactored some functions to be more generic and support huggingface transformers API.

Demo:

$ fairseq2 hg asr /tmp/fairseq2 --config max_num_elements=20 model_name=openai/whisper-tiny.en dtype=torch.float32

[08/09/24 19:00:49] INFO     fairseq2.recipes.hg.evaluator - Running evaluation on 1 device(s).     
eval: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   147   
[08/09/24 19:03:37] INFO     fairseq2.recipes.hg.evaluator - Eval Metrics - BLEU: 0.497937 | Elapsed
                             Time: 167s | Wall Time: 169s | brevity_penalty: 1.0 | length_ratio:    
                             1.1557228282673901 | precisions: [0.676166231361788,                   
                             0.5496829119972201, 0.4500613362139993, 0.3675025152851946] |          
                             reference_length: 52343 | translation_length: 60494                    
                    INFO     fairseq2.recipes.hg.evaluator - Evaluation complete in 169 seconds

Does your PR introduce any breaking changes? If yes, please list them: Hopefully none.

Check list:

[X] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
[X] Did you read the contributor guideline?
[X] Did you make sure that your PR does only one thing instead of bundling different changes together?
[X] Did you make sure to update the documentation with your changes? (if necessary)
[X] Did you write any new necessary tests?
[X] Did you verify new and existing tests pass locally with your changes?
[X] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

Aug 09 '24 17:08 ahmedsaed

I noticed that running the evaluation with max_samples > 20 on a 15GB VRAM setup was causing a CUDA out-of-memory error. To address this, I made improvements to better manage and free up memory. With these updates, you can now run the evaluation on the entire dataset, provided the batch_size fits within the available memory. I successfully tested this on the whole test split (approximately 2939 examples) using whisper-tiny and wav2vec2_asr_base_10h with max_num_elements=10.

Additionally, I found that the text data was being encoded only to be decoded again, so I eliminated this redundant step from the pipeline, along with making other refactoring adjustments.

Edit: max_num_elements can be increased significantly by defining max_audio_len as some audio samples were quite long and the whole batch gets padded to the longest sequence causing some batches to require a lot more memory than expected.

Aug 09 '24 19:08 ahmedsaed

I will close this PR in favor of 2 smaller PRs

AsrDatasetConfig which will also include all the refactoring made in this PR along with a config for handling ASR Datasets.
Whisper Integration will just include the whisper related code.

I will address all comments on the new PRs

Aug 15 '24 01:08 ahmedsaed

I have addressed all comments in the following PRs #749 and #751

Aug 16 '24 21:08 ahmedsaed