agents icon indicating copy to clipboard operation
agents copied to clipboard

Allow stt suppression by vad

Open preciselyV opened this issue 2 months ago • 4 comments

Feature Type

I cannot use LiveKit without it

Feature Description

Problem

Some less common STT models (especially non-english ones) are prune to misfire - they may falsely detect words while user is silent or recognize words from different language (for multi language models). It gets even worse in more challenging audio conditions like bad user microphone or noisy background.

Solution

The most obvious solution is to use VAD to detect the start of user utterance and delegate the recognition itself to STT, since VAD models are trained to be resistant to audio disturbance, while with STT's it's not always the case.

But, in current implementation of AudioRecognition any user input is sent to both STT and VAD simultaneously:

class AudioRecognition:
...
   def push_audio(self, frame: rtc.AudioFrame) -> None:
        self._sample_rate = frame.sample_rate
        if self._stt_ch is not None:
            self._stt_ch.send_nowait(frame)

        if self._vad_ch is not None:
            self._vad_ch.send_nowait(frame)

And there is also no direct connection between vad.VADEventType.END_OF_SPEECH and stt.SpeechEventType.FINAL_TRANSCRIPT events, since AudioRecognition._on_stt_event() and AudioRecognition._on_vad_event know nothing about each other.

The simplest way to solve this in my opinion is to introduce something like suppress_stt_with_vad() parameter to livekit.agents.AgentSession and pass it all the way up to the AudioRecognition, which in turn would do smth like:

class AudioRecognition:
    def __init__(self,...suppress_stt_with_vad:bool = False):
        self._start_of_speech_received = False
        self.suppress_stt_with_vad = suppress_stt_with_vad

    async def _on_stt_event(self, ev: stt.SpeechEvent) -> None:
        if not self._start_of_speech_received and self._suppress_stt_with_vad:
            logger.warning('STT is misfiring: received final transcript'
            ' without vads start of speech')
            self.clear_user_turn()
            return
        elif (ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT and
              self._start_of_speech_received and and self._suppress_stt_with_vad):
            self._start_of_speech_received = False
        ...
    async def _on_vad_event(self, ev: vad.VADEvent) -> None:
        if ev.type == vad.VADEventType.START_OF_SPEECH:
            self._start_of_speech_received = True
            await super()._on_vad_event(ev)

...

Workarounds / Alternatives

right now, since

# AgentActivity isn't exposed to the public API

we have to create a child of AudioRecognition

class VADDependingAudioRecognition(AudioRecognition):
    def __init__() -> None:
        super().__init__()
        self._start_of_speech_received = False

    async def _on_stt_event(self, ev: stt.SpeechEvent) -> None:
        if not self._start_of_speech_received:
            logger.warning('STT is misfiring: received final transcript'
                           ' without vads start of speech')
            self.clear_user_turn()
            return
        elif (ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT and
                self._start_of_speech_received):
            self._start_of_speech_received = False
        await super()._on_stt_event(ev)

    async def _on_vad_event(self, ev: vad.VADEvent) -> None:
        if ev.type == vad.VADEventType.START_OF_SPEECH:
            self._start_of_speech_received = True
        await super()._on_vad_event(ev)

And also children of AgentActivity and AgentSession to create appropriate instances:

class CustomAgentActivity(AgentActivity):
    def __init__(self, agent: Agent, sess: AgentSession) -> None:
        super().__init__(agent=agent, sess=sess)
        self._audio_recognition: VADDependingAudioRecognition | None = None
    async def _start_session(self) -> None:
        ...
        self._audio_recognition = VADDependingAudioRecognition(
            hooks=self,
            stt=self._agent.stt_node if self.stt else None,
            vad=self.vad,
            turn_detector=self.turn_detection if not isinstance(self.turn_detection, str) else None, # noqa
            min_endpointing_delay=self.min_endpointing_delay,
            max_endpointing_delay=self.max_endpointing_delay,
            turn_detection_mode=self._turn_detection_mode,
        )
        self._audio_recognition.start()
class CustomAgentSession(AgentSession):
    def __init__():
        super().__init__()
    async def _update_activity():
        async with self._activity_lock:
           ...
            if new_activity == "start":
                self._next_activity = CustomAgentActivity(agent, self)
            ...

Additional Context

Most of the code snippets are just examples, I'm perfectly aware that some STT's doesn't have capabilities.STREAMING, or VAD may not be present in a setup, or turn_detection may be set to STT, so there are some additional things to consider here

Im happy to implement it myself, but CONTRIBUTING.md suggests discussions with devs first, so waiting for your opinion on it :) thank you in advance

preciselyV avatar Nov 13 '25 07:11 preciselyV

Hi, what STT are you using? It would help to get more context on which STT may be prone to misfire. I see your point of prioritizing VAD however, that makes sense especially if you can't opt for another STT if it's the only one that supports a particular language.

I think checking out this related thread may also be helpful, filtering by confidence scores may resolve this. You could use either on_user_turn_completed() or stt_node() to implement this, depending on your use case

tinalenguyen avatar Nov 13 '25 17:11 tinalenguyen

We also have a STT StreamAdapter that combines VAD and STT for transcription, which is designed for STT that doesn't support streaming:

from livekit.agents.stt import StreamAdapter
...
stt = StreamAdapter(stt=stt, vad=vad)
...

It works by only sending audio frames to STT when there is a VAD span.

chenghao-mou avatar Nov 13 '25 17:11 chenghao-mou

Thank you for your responses !

The STT I'm using is yandex's . It is by far the best model for smaller CIS region languages like Kazakh or Uzbek. I did design a plugin for the biggest local telecom company, this is the only issue left to resolve.

Regarding your suggestions:

StreamAdapter

Although it is a useful tool, YandexSTT is designed for stream input and doesn't typically support batch frame processing. There are some tedious workarounds, but since stream implementation is always better for performance I would prefer to keep it stream.

Confidence averaging

That was one of my first ideas, but unfortunately Yandex doesn't return meaningful confidence scores for speech recognition, so the only way to identify false speech detections is filtering it by VAD's events

I've been using livekit for a year now, and it is the best tool I came around, so I'm rather keen on popularizing it among devs around my region. But most of them will face the same issue, since there's a very short list of STT providers for most of the local languages, and the best one performs like this

Also StreamAdapter suggestion did give me an idea: Instead of messing with the AudioRecognition we may create another adapter called VADSuppressedSTTAdapater (I am indeed open to name suggestions here :) ) which in turn would monitor both streams for appropriate events.

Again, I am willing to implement it myself, but I'm not sure which option is the best design wise.

If you are interested in this functionality, let me know your opinion on the best option, and I will make a pr in couple of days

preciselyV avatar Nov 14 '25 07:11 preciselyV

In that case, yes, you can implement a different adapter, but here is another idea you can try:

  • You can track user speech events with the user state change hooks, meaning you can access VAD spans without monitoring streams;
  • You can also track STT events inside stt_node. Based on the docs, you also know each event's corresponding timestamps;

Then you can do something like this (in pseudocode here for simplicity)

#track and accumulate vad spans in a deque:
@session.on("user_state_change")
def _on_user_state_change(ev):
    if ev.new_state == "speaking":
        agent.vad_spans.append(ev.created_at)
    elif ev.old_state == "speaking":
        agent.vad_spans[-1] = (agent.vad_spans[-1], ev.created_at)
...
#in the agent's stt node:
async for each event:
   while self.vad_spans:
     last_speech_start, last_speech_end = self.vad_spans.popleft()
     if event.start_time < last_speech_start_time:
        # ignore this event 
        self.vad_spans.appendleft((last_speech_start, last_speech_end))
        break
     if event.start_time > last_speech_end:
        continue
    yield event
    self.vad_spans.appendleft((last_speech_start, last_speech_end))
    break
  else:
      # ignore this event 
      pass
   

without implementing anything heavy, assuming their timestamp information is accurate enough.

I can definitely see how a simpler solution might be helpful. But with streaming services, we normally expect them to improve over time, so we don't create workarounds that are going to be obsolete when they do.

With that being said, if you do want to contribute, one direction I can think of is to make the existing StreamAdapter work with streaming-only STTs:

  • You can check if stt supports batch recognition by calling stt._recognize_impl, which will throw an NotImplementedError if it doesn't.
  • Start the STT stream and only push_frame when VAD is triggered;

chenghao-mou avatar Nov 14 '25 09:11 chenghao-mou