Add support for streaming speech translation

Open naymaraq opened this issue 2 months ago • 1 comments

[!IMPORTANT]
The Update branch button must only be pressed in very rare occassions. An outdated branch is never blocking the merge of a PR. Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds support for streaming speech translation using an LLM. It first performs streaming ASR, then simultaneously translates the transcribed source language into any target language.
- Users will see partial translations, which may be revised as new audio chunks arrive. After some point, a portion of the translation prefix will remain fixed.
Implements two waiting strategies for streaming translation: waitk and Longest Common Prefix (LCP) (to enable LCP set waitk=-1) :
- waitk specifies the maximum number of words the translation is allowed to lag behind the ASR transcript. If the translation falls more than waitk words behind, it automatically extends the prefix using the current translation.
- Set to -1 to disable this rule and rely solely on the LCP between current and previous translations.
- Larger values of waitk lead to more coherent translations, but increase the cost of generation because the model must produce more tokens.
The current implementation is based on the EuroLLM model, but users can add any other models supported by the vLLM engine. To do so, they must inherit from PromptTemplate, define the prompt, and specify how to format and extract translation from the LLM’s response.

Collection: [ASR]

Changelog

Add specific line by line info of high level changes in this PR.

Environment setup

Due to dependency conflicts between NeMo and vLLM, the recommended way to use streaming speech translation is to build from the following Dockerfile: scripts/installers/Dockerfile.speech_translation_vllm. Follow the instructions below to build the Docker image and run the container:

# Build the Docker Image
docker build -t speech_translation_vllm -f scripts/installers/Dockerfile.speech_translation_vllm .

# Run the Container
docker run --gpus all -it speech_translation_vllm bash

Usage

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[x] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[x] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Nov 28 '25 19:11 naymaraq

Hi, could you add an example how to run the streaming speech translation to "Usage" field?

Dec 03 '25 08:12 andrusenkoau