agents icon indicating copy to clipboard operation
agents copied to clipboard

Feature Request: Add Original Language Detection to Gladia STT Plugin

Open mowtschan opened this issue 1 month ago • 3 comments

Summary

The Gladia STT plugin currently exposes target_language from translation data but does not expose the original_language (source language) that was detected/used during transcription. This information would be valuable for multilingual applications and language-aware workflows.

Current Behavior

The plugin currently handles translation data and sets the target language:

target_language = translation_data.get("target_language", "")
language = translated_utterance.get("language", target_language)
...
if translated_text and language:
    speech_data = stt.SpeechData(
        language=language,  # Use the target language
        ...

The SpeechData.language field is already being used to store the target_language for translated text. However, the original_language (the language that was actually spoken/detected) is not being captured or exposed anywhere.

Problem

When translation is enabled, we lose information about what language the user was actually speaking. The language field in SpeechData contains the target language (what the text was translated to), but there's no way to access the original/source language (what was actually spoken). This becomes critical when using multiple translation_target_languages. According to the documentation, when multiple target languages are specified, the plugin emits a separate transcription event for each language. Without the original language information, it becomes impossible to distinguish which translations came from which source language.

Proposed Solution

Add support for capturing and exposing the original_language from Gladia's transcription/translation response.

Store in speaker_id field Since SpeechData.language is already taken by the target language, the speaker_id field could potentially be used to store the original language information:

...
original_language = translation_data.get("original_language", "")
...
if translated_text and language:
    speech_data = stt.SpeechData(
        ...
        speaker_id=original_language
        ...

Concerns: This feels like a misuse of the speaker_id field ;-(

mowtschan avatar Dec 26 '25 17:12 mowtschan

Hi, thanks for the detailed request! I'm able to retrieve the originally spoken language in SpeechData when iterating through the stt_node. Could you share how you were trying to access it for more context?

For reference, I override the STT node like so:

async def stt_node(self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings) -> AsyncIterable[stt.SpeechEvent]:
        async def event_stream():
            async for event in Agent.default.stt_node(self, audio, model_settings):
                print(event)
                yield event

        return event_stream()

And I get a stream of SpeechEvent, with the language in the alternatives field being the detected spoken one and not the target: SpeechEvent(type=<SpeechEventType.FINAL_TRANSCRIPT: 'final_transcript'>, request_id='...', alternatives=[SpeechData(language='zh', text='新年快乐', start_time=1.544, end_time=2.454, confidence=0.7050000000000001, speaker_id=None, is_primary_speaker=None, words=None)], recognition_usage=None)

tinalenguyen avatar Dec 27 '25 22:12 tinalenguyen

I'm trying to access It on user_input_transcribed event in agent session:

session = AgentSession()

@session.on("user_input_transcribed")
def on_transcript(event):
        logger.info(f"Transcript event: {event}")

        if event.is_final and event.speaker_id != event.language: # assuming I misuse `speaker_id` for `original_language`
            session.say(event.transcript)

mowtschan avatar Dec 29 '25 08:12 mowtschan

overriding the stt_node also will not help, both events have the transcript language but not the origin language: (origin spoken language was German)

SpeechEvent(type=<SpeechEventType.FINAL_TRANSCRIPT: 'final_transcript'>, ..., alternatives=[SpeechData(language='de', text='Guten Tag.', ...,  is_primary_speaker=None, words=['Guten', ' Tag.'])], recognition_usage=None)

SpeechEvent(type=<SpeechEventType.FINAL_TRANSCRIPT: 'final_transcript'>, ..., alternatives=[SpeechData(language='en', text='Good day.', ...,  is_primary_speaker=None, words=['Good', ' day.'])], recognition_usage=None)

I configure plugin like:

stt=gladia.STT(
    anguages=["en", "de"],
    code_switching=True,
    ...
    translation_enabled=True,
    translation_target_languages=["de", "en"],
    ...

mowtschan avatar Dec 29 '25 09:12 mowtschan