Multilingual Agent (STT and TTS); is that possible with LiveKit?
Hi,
I recently started using LiveKit for building an Agent, so far I have been able to make it work with a simple RAG Example.
stt_google = google.STT(
languages=["nl-NL", "en-US"],
detect_language=True, interim_results=True)
stt_openai = openai.STT(detect_language=True)
language = stt_openai.
tts = google.TTS(language="nl-NL", voice_name="nl-NL-Standard-C")
agent = VoicePipelineAgent(
vad=ctx.proc.userdata["vad"],
stt=stt_openai,
llm=openai.LLM(model="gpt-4o-mini"),
tts=tts,
chat_ctx=initial_ctx,
turn_detector=turn_detector.EOUModel(),
will_synthesize_assistant_reply=will_synthesize_assistant_reply_rag,
)
The scenario I am looking to implement is as follows (either using Google Speech or OpenAI Whisper): the user talks in a number of languages, English, Dutch, French, Spanish etc. Based on this I want to be able to get (detect) the spoken language by the user and set the language spoken by the Agent. Have been going through Slack and documentation, but am unable to find out how to best do this.
Any pointers or tips and experiences are welcome. Thanks in adance.
I am having the same issue where when selecting two languages it only apply the last language code in the list and not detecting any other languages. I am using google STT:
stt_google = google.STT(
languages=["es-MX", "en-US"],
detect_language=True)
Have you guys found any solutions for this ?
Have you guys found any solitions for this ?
Unfortunately, no—I’m still searching for a solution.
You can achive that by using secondary STT like Whisper on Groq. I made some custamizations on VoicePipelineAgent by overriding it's user_stopped_speaking event emitter like this:
def _on_end_of_speech(ev: vad.VADEvent) -> None:
self._plotter.plot_event("user_stopped_speaking")
self.emit("user_stopped_speaking", ev)
self._deferred_validation.on_human_end_of_speech(ev)
after that, you can catch VAD frames and able to send them Whisper Groq in parallel.
Note: you also need to override Livekit's Whisper implementation like this:
@override
async def _recognize_impl(
self,
buffer: AudioBuffer,
*,
language: str | None,
conn_options: APIConnectOptions = DEFAULT_API_CONNECT_OPTIONS,
) -> stt.SpeechEvent:
try:
config = self._sanitize_options(language=language)
data = rtc.combine_audio_frames(buffer).to_wav_bytes()
resp = await self._client.audio.transcriptions.create(
file=(
"file.wav",
data,
"audio/wav",
),
model=self._opts.model,
language="",
# verbose_json returns language and other details
response_format="verbose_json",
timeout=httpx.Timeout(30, connect=conn_options.timeout),
)
event = stt.SpeechEvent(type=stt.SpeechEventType.RECOGNITION_USAGE)
if resp.segments:
text, score, lang = "", float("inf"), None
for segment in resp.segments:
if segment.avg_logprob > 0.5 or (
segment.no_speech_prob > 1.0 and segment.avg_logprob < -1.0
):
continue
text += f"{segment.text} "
if score == float("inf"):
score = math.exp(segment.avg_logprob)
else:
score += math.exp(segment.avg_logprob)
lang = iso639.Language.from_name(resp.language.title()).part1
score /= len(resp.segments)
event.type = stt.SpeechEventType.FINAL_TRANSCRIPT
event.alternatives = [
stt.SpeechData(
text=text or resp.text or "",
language=lang or config.language or "",
confidence=score or 0.0,
)
]
return event
except APITimeoutError:
raise APITimeoutError()
except APIStatusError as e:
raise APIStatusError(
e.message,
status_code=e.status_code,
request_id=e.request_id,
body=e.body,
)
except Exception as e:
raise APIConnectionError() from e
Finally, you need to attach callbacks for getting that audio buffer and send it to your language_inference routine like this:
async def handle_event(event: vad.VADEvent) -> None:
resp = await your_infer_language_function(event.frames)
if resp:
# those line bellow for removing redundancy after language detected and it's important
# or your STT can generate false positives for your speech since it's waiting for wrong language.
stt = cast(deepgram.STT, agent._stt)
if stt._streams:
stream = stt._streams.pop()
stream._event_ch.send_nowait(resp)
@agent.on("user_stopped_speaking")
def speech_ended(ev: vad.VADEvent):
_ = asyncio.create_task(handle_event(ev))
@imsakg interesting, would this also work with other STT providers? Like Deepgram?
Since Whisper by itself already detects language by default. So far we notice that the best STT model in languages other than English is Deepgram. The problem there is, that you need to define the STT model language in advance. So if the user talks in English, and you set the language to French, it will not get the English transcript.
This approach would than only work with Whisper, Google STT etc. (which support multilingual cases), is that correct?
@imsakg interesting, would this also work with other STT providers? Like Deepgram?
@vvv-001 Yupp! I'm exactly using Deepgram as only STT provider on my pipeline. I'm using Whisper for language inference only. Whenever Whisper detected a language, sending detected language to Deepgram and setting it with new language.
Here is a little code snippet:
agent._stt.update_options(language=language_detected_by_whisper)
BTW, this technique works flawlessly but I can't share more details. You need to figure it out on your own.
@imsakg why not just use gemini multilangual STT?
Hey @imsakg, could you share a bit more context? Since we’re all benefiting from open source, I believe it’s important we also contribute back to it.
Hey @imsakg, could you share a bit more context?
Hey, I believe I have provided enough context along with the source codes. I also think anyone can accomplish this by following my previous shares with minimal effort.
Since we’re all benefiting from open source, I believe it’s important we also contribute back to it.
I contributed to many open-source projects, including Livekit. As I mentioned, you can achieve your goals by reading my previous comments and implementing code-blocks that I shared on your own.
Livekit has a recipe for multiple language support. Check if this helps - https://github.com/livekit-examples/python-agents-examples/blob/main/pipeline-tts/elevenlabs_change_language.py
@atharva-create this doesn't work, so I am not sure how they came up with this example.
If I set stt like this with Deepgram:
stt=deepgram.STT(
model="nova-2-general",
language="nl"
),
It won't transcribe any language other than Dutch (nl), especially with nova-2-general. So you can speak as much English into it as much as possible, but it won't work.
This only works; when the STT is set to 'nl' I have to tell it in Dutch to switch to English, but I cannot tell it in English to switch to English, so that kind of beats the purpose of multilingual
@vvv-001
Here's how I've been able to achieve this:
- Set Deepgram STT language to "multi" so that deepgram can understand any spoken language
AgentSession(
stt=deepgram.STT(model="nova-3", language="multi")
...
)
- Create a tool to switch language. I'm using Cartesia for TTS.
from livekit.agents import RunContext, function_tool
LANGUAGE_OPTIONS = {
"en": {"voice": "<<english-voice-id>>", "greeting": "Hello, how can I help you today?"},
"es": {"voice": "<<spanish-voice-id>>", "greeting": "Hola, en que puedo ayudarte?"},
}
@function_tool(
description=(
"Switch to speaking the specified language."
f"language is one of the following: {','.join(LANGUAGE_OPTIONS.keys())}"
)
)
async def switch_language(context: RunContext, language: str):
option = LANGUAGE_OPTIONS.get(language)
if option is None:
raise ValueError(f"Unsupported language: {language}")
tts = context.session.tts
if tts is not None:
tts.update_options(language=language, voice=option["voice"])
await context.session.say(option["greeting"])
- Add following to the system prompt:
Respond to users in the same language they speak. You support English and Spanish. Detect the user's language and reply in that language. If the user requests for an unsupported language, politely reply that you only support English and Spanish.
Hi @oozzal thanks for the message, we stick with using nova-2-general from Deepgram, because Nova-3-general doesn't work well with multi model in specific languages that we want supported.
We also talked to Deepgram about nova-3 issues with multi model, they are aware of it and also fixing it.
@vvv-001 how have you solved the problem of multilingual with nova-2-general.
Here is my solution:
Listens first 3 seconds (or desired duration, just set it in your function) of the user's voice and detect with whisper which returns detected language as json.
@ctx.room.on("track_subscribed")
def _on_track_subscribed(track: rtc.Track, pub: rtc.RemoteTrackPublication, participant: rtc.RemoteParticipant):
nonlocal user_audio_track
if track.kind == rtc.TrackKind.KIND_AUDIO:
user_audio_track = track
async def _start_language_detection():
nonlocal lang_detection_started
if lang_detection_started or not user_audio_track:
return
async def _apply_lang(lang_code: str | None, transcript: str):
if not lang_code:
return
try:
await agent._switch_language(lang_code)
except Exception as e:
logger.warning(f"Failed to persist/switch language: {e}")
try:
if hasattr(livekit_session, "stt") and livekit_session.stt:
livekit_session.stt.update_options(language=lang_code)
logger.info(f"STT language updated to: {lang_code}")
except Exception as e:
logger.warning(f"STT update_options failed: {e}")
agent.lang_locked = True
logger.info(f"Language locked: {lang_code}")
await simple_lang_detect_from_audio(
track=user_audio_track,
api_key=whisper_api_key,
on_language_detected=_apply_lang,
max_duration=3
)
Hy all , I am using azure stt for multilingual part in live-kit agent . But I am unable to achieve my goal . It detects only one langauge from the language list
What is the latency that you guys are achieveing using these approach
What is the latency that you guys are achieveing using these approach
Not much. Groq's Whisper one of the fastest STT available in market. Instead you can use Eleven's Scribe V2 which is multilingual and working very well.