fairseq2
fairseq2 copied to clipboard
Add support for huggingface ASR datasets
What does this PR do? Please describe: This PR builds upon #740 by incorporating the refactoring changes here and adding new commits that enable support for defining other datasets via configuration.
Example:
fairseq2 hg asr /tmp/fairseq2 --config max_samples=2 dataset_config.split=test dataset_config.dataset_path=google/fleurs dataset_config.
dataset_name=en_us dataset_config.target_column=['transcription'] # override default preset
[08/15/24 03:55:44] INFO fairseq2.recipes.hg.evaluator - Running evaluation on 1 device(s).
[08/15/24 03:57:55] INFO fairseq2.recipes.hg.evaluator - Eval Metrics - BLEU: 0 | Elapsed Time: 130s | Wall Time: 131s | brevity_penalty: 1.0 | length_ratio: 1.0 |
precisions: [0.375, 0.15789473684210525, 0.05555555555555555, 0.0] | reference_length: 40 | translation_length: 40
INFO fairseq2.recipes.hg.evaluator - Evaluation complete in 131 seconds
This evaluates wav2vec2 model on google/fleurs dataset by just overriding the relevant configs.
Does your PR introduce any breaking changes? If yes, please list them: None.
Check list:
- [x] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
- [x] Did you read the contributor guideline?
- [X] Did you make sure that your PR does only one thing instead of bundling different changes together?
- [X] Did you make sure to update the documentation with your changes? (if necessary)
- [x] Did you write any new necessary tests?
- [X] Did you verify new and existing tests pass locally with your changes?
- [X] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)