agents icon indicating copy to clipboard operation
agents copied to clipboard

Multilingual Agent (STT and TTS); is that possible with LiveKit?

Open vvv-001 opened this issue 1 year ago • 18 comments

Hi,

I recently started using LiveKit for building an Agent, so far I have been able to make it work with a simple RAG Example.

    stt_google = google.STT(
        languages=["nl-NL", "en-US"],
        detect_language=True, interim_results=True)

    stt_openai = openai.STT(detect_language=True)
    language = stt_openai.

    tts = google.TTS(language="nl-NL", voice_name="nl-NL-Standard-C")

    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=stt_openai,
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=tts,
        chat_ctx=initial_ctx,
        turn_detector=turn_detector.EOUModel(),
        will_synthesize_assistant_reply=will_synthesize_assistant_reply_rag,
    )

The scenario I am looking to implement is as follows (either using Google Speech or OpenAI Whisper): the user talks in a number of languages, English, Dutch, French, Spanish etc. Based on this I want to be able to get (detect) the spoken language by the user and set the language spoken by the Agent. Have been going through Slack and documentation, but am unable to find out how to best do this.

Any pointers or tips and experiences are welcome. Thanks in adance.

vvv-001 avatar Jan 03 '25 15:01 vvv-001

I am having the same issue where when selecting two languages it only apply the last language code in the list and not detecting any other languages. I am using google STT:

 stt_google = google.STT(
        languages=["es-MX", "en-US"],
        detect_language=True)

aalkhulaifi605 avatar Jan 27 '25 20:01 aalkhulaifi605

Have you guys found any solutions for this ?

0xLoukman avatar Mar 20 '25 18:03 0xLoukman

Have you guys found any solitions for this ?

Unfortunately, no—I’m still searching for a solution.

aalkhulaifi605 avatar Mar 20 '25 18:03 aalkhulaifi605

You can achive that by using secondary STT like Whisper on Groq. I made some custamizations on VoicePipelineAgent by overriding it's user_stopped_speaking event emitter like this:

        def _on_end_of_speech(ev: vad.VADEvent) -> None:
            self._plotter.plot_event("user_stopped_speaking")
            self.emit("user_stopped_speaking", ev)
            self._deferred_validation.on_human_end_of_speech(ev)

after that, you can catch VAD frames and able to send them Whisper Groq in parallel.

Note: you also need to override Livekit's Whisper implementation like this:

    @override
    async def _recognize_impl(
        self,
        buffer: AudioBuffer,
        *,
        language: str | None,
        conn_options: APIConnectOptions = DEFAULT_API_CONNECT_OPTIONS,
    ) -> stt.SpeechEvent:
        try:
            config = self._sanitize_options(language=language)
            data = rtc.combine_audio_frames(buffer).to_wav_bytes()
            resp = await self._client.audio.transcriptions.create(
                file=(
                    "file.wav",
                    data,
                    "audio/wav",
                ),
                model=self._opts.model,
                language="",
                # verbose_json returns language and other details
                response_format="verbose_json",
                timeout=httpx.Timeout(30, connect=conn_options.timeout),
            )

            event = stt.SpeechEvent(type=stt.SpeechEventType.RECOGNITION_USAGE)
            if resp.segments:
                text, score, lang = "", float("inf"), None
                for segment in resp.segments:
                    if segment.avg_logprob > 0.5 or (
                        segment.no_speech_prob > 1.0 and segment.avg_logprob < -1.0
                    ):
                        continue

                    text += f"{segment.text} "
                    if score == float("inf"):
                        score = math.exp(segment.avg_logprob)
                    else:
                        score += math.exp(segment.avg_logprob)
                    lang = iso639.Language.from_name(resp.language.title()).part1
                score /= len(resp.segments)

                event.type = stt.SpeechEventType.FINAL_TRANSCRIPT
                event.alternatives = [
                    stt.SpeechData(
                        text=text or resp.text or "",
                        language=lang or config.language or "",
                        confidence=score or 0.0,
                    )
                ]

            return event
        except APITimeoutError:
            raise APITimeoutError()
        except APIStatusError as e:
            raise APIStatusError(
                e.message,
                status_code=e.status_code,
                request_id=e.request_id,
                body=e.body,
            )
        except Exception as e:
            raise APIConnectionError() from e

Finally, you need to attach callbacks for getting that audio buffer and send it to your language_inference routine like this:

    async def handle_event(event: vad.VADEvent) -> None:
        resp = await your_infer_language_function(event.frames)
        if resp:
			# those line bellow for removing redundancy after language detected and it's important
			# or your STT can generate false positives for your speech since it's waiting for wrong language.
            stt = cast(deepgram.STT, agent._stt)
            if stt._streams:
                stream = stt._streams.pop()
                stream._event_ch.send_nowait(resp)

    @agent.on("user_stopped_speaking")
    def speech_ended(ev: vad.VADEvent):
        _ = asyncio.create_task(handle_event(ev))


imsakg avatar Mar 21 '25 12:03 imsakg

@imsakg interesting, would this also work with other STT providers? Like Deepgram?

Since Whisper by itself already detects language by default. So far we notice that the best STT model in languages other than English is Deepgram. The problem there is, that you need to define the STT model language in advance. So if the user talks in English, and you set the language to French, it will not get the English transcript.

This approach would than only work with Whisper, Google STT etc. (which support multilingual cases), is that correct?

vvv-001 avatar Mar 21 '25 13:03 vvv-001

@imsakg interesting, would this also work with other STT providers? Like Deepgram?

@vvv-001 Yupp! I'm exactly using Deepgram as only STT provider on my pipeline. I'm using Whisper for language inference only. Whenever Whisper detected a language, sending detected language to Deepgram and setting it with new language.

Here is a little code snippet:

agent._stt.update_options(language=language_detected_by_whisper)

BTW, this technique works flawlessly but I can't share more details. You need to figure it out on your own.

imsakg avatar Mar 21 '25 15:03 imsakg

@imsakg why not just use gemini multilangual STT?

0xLoukman avatar Mar 21 '25 16:03 0xLoukman

Hey @imsakg, could you share a bit more context? Since we’re all benefiting from open source, I believe it’s important we also contribute back to it.

firattamur avatar Mar 29 '25 20:03 firattamur

Hey @imsakg, could you share a bit more context?

Hey, I believe I have provided enough context along with the source codes. I also think anyone can accomplish this by following my previous shares with minimal effort.

Since we’re all benefiting from open source, I believe it’s important we also contribute back to it.

I contributed to many open-source projects, including Livekit. As I mentioned, you can achieve your goals by reading my previous comments and implementing code-blocks that I shared on your own.

imsakg avatar Mar 30 '25 14:03 imsakg

Livekit has a recipe for multiple language support. Check if this helps - https://github.com/livekit-examples/python-agents-examples/blob/main/pipeline-tts/elevenlabs_change_language.py

atharva-create avatar Apr 19 '25 14:04 atharva-create

@atharva-create this doesn't work, so I am not sure how they came up with this example.

If I set stt like this with Deepgram:

        stt=deepgram.STT(
            model="nova-2-general",
            language="nl"
        ),

It won't transcribe any language other than Dutch (nl), especially with nova-2-general. So you can speak as much English into it as much as possible, but it won't work.

This only works; when the STT is set to 'nl' I have to tell it in Dutch to switch to English, but I cannot tell it in English to switch to English, so that kind of beats the purpose of multilingual

vvv-001 avatar Apr 23 '25 18:04 vvv-001

@vvv-001

Here's how I've been able to achieve this:

AgentSession(
    stt=deepgram.STT(model="nova-3", language="multi")
    ...
)
  • Create a tool to switch language. I'm using Cartesia for TTS.
from livekit.agents import RunContext, function_tool

LANGUAGE_OPTIONS = {
    "en": {"voice": "<<english-voice-id>>", "greeting": "Hello, how can I help you today?"},
    "es": {"voice": "<<spanish-voice-id>>", "greeting": "Hola, en que puedo ayudarte?"},
}


@function_tool(
    description=(
        "Switch to speaking the specified language."
        f"language is one of the following: {','.join(LANGUAGE_OPTIONS.keys())}"
    )
)
async def switch_language(context: RunContext, language: str):
    option = LANGUAGE_OPTIONS.get(language)
    if option is None:
        raise ValueError(f"Unsupported language: {language}")

    tts = context.session.tts
    if tts is not None:
        tts.update_options(language=language, voice=option["voice"])

    await context.session.say(option["greeting"])
  • Add following to the system prompt:
Respond to users in the same language they speak. You support English and Spanish. Detect the user's language and reply in that language. If the user requests for an unsupported language, politely reply that you only support English and Spanish.

oozzal avatar Jul 24 '25 04:07 oozzal

Hi @oozzal thanks for the message, we stick with using nova-2-general from Deepgram, because Nova-3-general doesn't work well with multi model in specific languages that we want supported.

We also talked to Deepgram about nova-3 issues with multi model, they are aware of it and also fixing it.

vvv-001 avatar Jul 24 '25 21:07 vvv-001

@vvv-001 how have you solved the problem of multilingual with nova-2-general.

abhismatrix1 avatar Jul 24 '25 23:07 abhismatrix1

Here is my solution:

Listens first 3 seconds (or desired duration, just set it in your function) of the user's voice and detect with whisper which returns detected language as json.

       @ctx.room.on("track_subscribed")
        def _on_track_subscribed(track: rtc.Track, pub: rtc.RemoteTrackPublication, participant: rtc.RemoteParticipant):
            nonlocal user_audio_track
            if track.kind == rtc.TrackKind.KIND_AUDIO:
                user_audio_track = track

        async def _start_language_detection():
            nonlocal lang_detection_started
            if lang_detection_started or not user_audio_track:
                return

            async def _apply_lang(lang_code: str | None, transcript: str):
                if not lang_code:
                    return

                try:
                    await agent._switch_language(lang_code)
                except Exception as e:
                    logger.warning(f"Failed to persist/switch language: {e}")

                try:
                    if hasattr(livekit_session, "stt") and livekit_session.stt:
                        livekit_session.stt.update_options(language=lang_code)
                        logger.info(f"STT language updated to: {lang_code}")
                except Exception as e:
                    logger.warning(f"STT update_options failed: {e}")
 
                agent.lang_locked = True
                logger.info(f"Language locked: {lang_code}")

            await simple_lang_detect_from_audio(
                track=user_audio_track,
                api_key=whisper_api_key,
                on_language_detected=_apply_lang,
                max_duration=3
            )

CanGoymen avatar Sep 10 '25 14:09 CanGoymen

Hy all , I am using azure stt for multilingual part in live-kit agent . But I am unable to achieve my goal . It detects only one langauge from the language list

Ritesh1244 avatar Nov 06 '25 04:11 Ritesh1244

What is the latency that you guys are achieveing using these approach

Sameerig avatar Dec 18 '25 06:12 Sameerig

What is the latency that you guys are achieveing using these approach

Not much. Groq's Whisper one of the fastest STT available in market. Instead you can use Eleven's Scribe V2 which is multilingual and working very well.

imsakg avatar Dec 18 '25 06:12 imsakg