Add support for streaming speech translation
[!IMPORTANT]
TheUpdate branchbutton must only be pressed in very rare occassions. An outdated branch is never blocking the merge of a PR. Please reach out to the automation team before pressing that button.
What does this PR do ?
- Adds support for streaming speech translation using an LLM. It first performs streaming ASR, then simultaneously translates the transcribed source language into any target language.
- Users will see partial translations, which may be revised as new audio chunks arrive. After some point, a portion of the translation prefix will remain fixed.
- Implements two waiting strategies for streaming translation:
waitkand Longest Common Prefix (LCP) (to enable LCP setwaitk=-1) :-
waitkspecifies the maximum number of words the translation is allowed to lag behind the ASR transcript. If the translation falls more thanwaitkwords behind, it automatically extends the prefix using the current translation. - Set to -1 to disable this rule and rely solely on the LCP between current and previous translations.
- Larger values of
waitklead to more coherent translations, but increase the cost of generation because the model must produce more tokens.
-
- The current implementation is based on the
EuroLLMmodel, but users can add any other models supported by the vLLM engine. To do so, they must inherit fromPromptTemplate, define the prompt, and specify how toformatandextracttranslation from the LLM’s response.
Collection: [ASR]
Changelog
- Add specific line by line info of high level changes in this PR.
Environment setup
Due to dependency conflicts between NeMo and vLLM, the recommended way to use streaming speech translation is to build from the following Dockerfile: scripts/installers/Dockerfile.speech_translation_vllm. Follow the instructions below to build the Docker image and run the container:
# Build the Docker Image
docker build -t speech_translation_vllm -f scripts/installers/Dockerfile.speech_translation_vllm .
# Run the Container
docker run --gpus all -it speech_translation_vllm bash
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [x] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)
Hi, could you add an example how to run the streaming speech translation to "Usage" field?