Add support for huggingface ASR datasets

Open ahmedsaed opened this issue 1 year ago • 0 comments

What does this PR do? Please describe: This PR builds upon #740 by incorporating the refactoring changes here and adding new commits that enable support for defining other datasets via configuration.

Example:

fairseq2 hg asr /tmp/fairseq2 --config max_samples=2 dataset_config.split=test dataset_config.dataset_path=google/fleurs dataset_config.
dataset_name=en_us dataset_config.target_column=['transcription'] # override default preset

[08/15/24 03:55:44] INFO     fairseq2.recipes.hg.evaluator - Running evaluation on 1 device(s).                                                                                  
[08/15/24 03:57:55] INFO     fairseq2.recipes.hg.evaluator - Eval Metrics - BLEU: 0 | Elapsed Time: 130s | Wall Time: 131s | brevity_penalty: 1.0 | length_ratio: 1.0 |          
                             precisions: [0.375, 0.15789473684210525, 0.05555555555555555, 0.0] | reference_length: 40 | translation_length: 40                                  
                    INFO     fairseq2.recipes.hg.evaluator - Evaluation complete in 131 seconds

This evaluates wav2vec2 model on google/fleurs dataset by just overriding the relevant configs.

Does your PR introduce any breaking changes? If yes, please list them: None.

Check list:

[x] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
[x] Did you read the contributor guideline?
[X] Did you make sure that your PR does only one thing instead of bundling different changes together?
[X] Did you make sure to update the documentation with your changes? (if necessary)
[x] Did you write any new necessary tests?
[X] Did you verify new and existing tests pass locally with your changes?
[X] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

Aug 15 '24 04:08 ahmedsaed