Allow stt suppression by vad
Feature Type
I cannot use LiveKit without it
Feature Description
Problem
Some less common STT models (especially non-english ones) are prune to misfire - they may falsely detect words while user is silent or recognize words from different language (for multi language models). It gets even worse in more challenging audio conditions like bad user microphone or noisy background.
Solution
The most obvious solution is to use VAD to detect the start of user utterance and delegate the recognition itself to STT, since VAD models are trained to be resistant to audio disturbance, while with STT's it's not always the case.
But, in current implementation of AudioRecognition any user input is sent to both STT and VAD simultaneously:
class AudioRecognition:
...
def push_audio(self, frame: rtc.AudioFrame) -> None:
self._sample_rate = frame.sample_rate
if self._stt_ch is not None:
self._stt_ch.send_nowait(frame)
if self._vad_ch is not None:
self._vad_ch.send_nowait(frame)
And there is also no direct connection between vad.VADEventType.END_OF_SPEECH and stt.SpeechEventType.FINAL_TRANSCRIPT events, since AudioRecognition._on_stt_event() and AudioRecognition._on_vad_event know nothing about each other.
The simplest way to solve this in my opinion is to introduce something like suppress_stt_with_vad() parameter to livekit.agents.AgentSession and pass it all the way up to the AudioRecognition, which in turn would do smth like:
class AudioRecognition:
def __init__(self,...suppress_stt_with_vad:bool = False):
self._start_of_speech_received = False
self.suppress_stt_with_vad = suppress_stt_with_vad
async def _on_stt_event(self, ev: stt.SpeechEvent) -> None:
if not self._start_of_speech_received and self._suppress_stt_with_vad:
logger.warning('STT is misfiring: received final transcript'
' without vads start of speech')
self.clear_user_turn()
return
elif (ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT and
self._start_of_speech_received and and self._suppress_stt_with_vad):
self._start_of_speech_received = False
...
async def _on_vad_event(self, ev: vad.VADEvent) -> None:
if ev.type == vad.VADEventType.START_OF_SPEECH:
self._start_of_speech_received = True
await super()._on_vad_event(ev)
...
Workarounds / Alternatives
right now, since
# AgentActivity isn't exposed to the public API
we have to create a child of AudioRecognition
class VADDependingAudioRecognition(AudioRecognition):
def __init__() -> None:
super().__init__()
self._start_of_speech_received = False
async def _on_stt_event(self, ev: stt.SpeechEvent) -> None:
if not self._start_of_speech_received:
logger.warning('STT is misfiring: received final transcript'
' without vads start of speech')
self.clear_user_turn()
return
elif (ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT and
self._start_of_speech_received):
self._start_of_speech_received = False
await super()._on_stt_event(ev)
async def _on_vad_event(self, ev: vad.VADEvent) -> None:
if ev.type == vad.VADEventType.START_OF_SPEECH:
self._start_of_speech_received = True
await super()._on_vad_event(ev)
And also children of AgentActivity and AgentSession to create appropriate instances:
class CustomAgentActivity(AgentActivity):
def __init__(self, agent: Agent, sess: AgentSession) -> None:
super().__init__(agent=agent, sess=sess)
self._audio_recognition: VADDependingAudioRecognition | None = None
async def _start_session(self) -> None:
...
self._audio_recognition = VADDependingAudioRecognition(
hooks=self,
stt=self._agent.stt_node if self.stt else None,
vad=self.vad,
turn_detector=self.turn_detection if not isinstance(self.turn_detection, str) else None, # noqa
min_endpointing_delay=self.min_endpointing_delay,
max_endpointing_delay=self.max_endpointing_delay,
turn_detection_mode=self._turn_detection_mode,
)
self._audio_recognition.start()
class CustomAgentSession(AgentSession):
def __init__():
super().__init__()
async def _update_activity():
async with self._activity_lock:
...
if new_activity == "start":
self._next_activity = CustomAgentActivity(agent, self)
...
Additional Context
Most of the code snippets are just examples, I'm perfectly aware that some STT's doesn't have capabilities.STREAMING, or VAD may not be present in a setup, or turn_detection may be set to STT, so there are some additional things to consider here
Im happy to implement it myself, but CONTRIBUTING.md suggests discussions with devs first, so waiting for your opinion on it :) thank you in advance
Hi, what STT are you using? It would help to get more context on which STT may be prone to misfire. I see your point of prioritizing VAD however, that makes sense especially if you can't opt for another STT if it's the only one that supports a particular language.
I think checking out this related thread may also be helpful, filtering by confidence scores may resolve this. You could use either on_user_turn_completed() or stt_node() to implement this, depending on your use case
We also have a STT StreamAdapter that combines VAD and STT for transcription, which is designed for STT that doesn't support streaming:
from livekit.agents.stt import StreamAdapter
...
stt = StreamAdapter(stt=stt, vad=vad)
...
It works by only sending audio frames to STT when there is a VAD span.
Thank you for your responses !
The STT I'm using is yandex's . It is by far the best model for smaller CIS region languages like Kazakh or Uzbek. I did design a plugin for the biggest local telecom company, this is the only issue left to resolve.
Regarding your suggestions:
StreamAdapter
Although it is a useful tool, YandexSTT is designed for stream input and doesn't typically support batch frame processing. There are some tedious workarounds, but since stream implementation is always better for performance I would prefer to keep it stream.
Confidence averaging
That was one of my first ideas, but unfortunately Yandex doesn't return meaningful confidence scores for speech recognition, so the only way to identify false speech detections is filtering it by VAD's events
I've been using livekit for a year now, and it is the best tool I came around, so I'm rather keen on popularizing it among devs around my region. But most of them will face the same issue, since there's a very short list of STT providers for most of the local languages, and the best one performs like this
Also StreamAdapter suggestion did give me an idea:
Instead of messing with the AudioRecognition we may create another adapter called VADSuppressedSTTAdapater (I am indeed open to name suggestions here :) ) which in turn would monitor both streams for appropriate events.
Again, I am willing to implement it myself, but I'm not sure which option is the best design wise.
If you are interested in this functionality, let me know your opinion on the best option, and I will make a pr in couple of days
In that case, yes, you can implement a different adapter, but here is another idea you can try:
- You can track user speech events with the user state change hooks, meaning you can access VAD spans without monitoring streams;
- You can also track STT events inside
stt_node. Based on the docs, you also know each event's corresponding timestamps;
Then you can do something like this (in pseudocode here for simplicity)
#track and accumulate vad spans in a deque:
@session.on("user_state_change")
def _on_user_state_change(ev):
if ev.new_state == "speaking":
agent.vad_spans.append(ev.created_at)
elif ev.old_state == "speaking":
agent.vad_spans[-1] = (agent.vad_spans[-1], ev.created_at)
...
#in the agent's stt node:
async for each event:
while self.vad_spans:
last_speech_start, last_speech_end = self.vad_spans.popleft()
if event.start_time < last_speech_start_time:
# ignore this event
self.vad_spans.appendleft((last_speech_start, last_speech_end))
break
if event.start_time > last_speech_end:
continue
yield event
self.vad_spans.appendleft((last_speech_start, last_speech_end))
break
else:
# ignore this event
pass
without implementing anything heavy, assuming their timestamp information is accurate enough.
I can definitely see how a simpler solution might be helpful. But with streaming services, we normally expect them to improve over time, so we don't create workarounds that are going to be obsolete when they do.
With that being said, if you do want to contribute, one direction I can think of is to make the existing StreamAdapter work with streaming-only STTs:
- You can check if stt supports batch recognition by calling
stt._recognize_impl, which will throw anNotImplementedErrorif it doesn't. - Start the STT stream and only
push_framewhen VAD is triggered;