agents icon indicating copy to clipboard operation
agents copied to clipboard

Bug: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics

Open karimmalhas opened this issue 2 months ago • 1 comments

Bug Description

Bug Report: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics

Description

The Soniox STT plugin has two critical bugs that prevent it from working properly with LiveKit's turn detection and usage tracking:

  1. Missing END_OF_SPEECH events: The plugin receives the <end> token from Soniox but never emits SpeechEventType.END_OF_SPEECH, so LiveKit sessions can't close turns properly even though Soniox has finished the utterance.

  2. Missing RecognitionUsage metrics: The plugin doesn't emit RecognitionUsage events, making it impossible to track audio duration for billing and analytics.

Impact

  • Applications cannot use turn_detection_mode="stt" with Soniox
  • No way to track audio usage for billing/analytics
  • Users miss out on Soniox's excellent semantic endpoint detection

Environment

  • Running in production on Breez voice agent platform
  • Tested with LiveKit sessions using Soniox STT
  • Wrapper code proves the fix works correctly

Additional Context

This was identified in collaboration with Klemen (Soniox CEO) and Bojan from Soniox team. The fix follows the same pattern as other STT plugins (Deepgram) and is fully backwards compatible.

Happy to provide more details or submit a PR if helpful!

Expected Behavior

Expected Behavior

  1. When Soniox sends a FINAL_TRANSCRIPT (with <end> token), the plugin should emit a SpeechEventType.END_OF_SPEECH event
  2. The plugin should track audio frame durations and periodically emit RecognitionUsage events (similar to Deepgram plugin)

Reproduction Steps

## Reproduction

1. Create a LiveKit session with Soniox STT
2. Set `turn_detection_mode="stt"`
3. Observe that turns never close even though Soniox sends `<end>` token
4. Try to capture RecognitionUsage metrics - none are emitted

Operating System

docker

Models Used

soniox-tts

Package Versions

livekit-plugins-soniox==1.3.1

Session/Room/Call IDs

No response

Proposed Solution

## Proposed Fix

We've been running a wrapper in production that fixes both issues. The implementation is straightforward:

### Fix 1: Emit END_OF_SPEECH after FINAL_TRANSCRIPT


# In the event forwarding loop (_run method or similar)
async for event in self._inner_stream:
    self._event_ch.send_nowait(event)
    
    # Add this:
    if event.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
        try:
            self._event_ch.send_nowait(
                stt.SpeechEvent(type=stt.SpeechEventType.END_OF_SPEECH)
            )
        except Exception as e:
            logger.warning(f"Failed to emit END_OF_SPEECH: {e}")


### Fix 2: Track and Emit RecognitionUsage Metrics


# Import
from livekit.plugins.deepgram._utils import PeriodicCollector

# In __init__
self._audio_duration_collector = PeriodicCollector(
    callback=self._on_audio_duration_report,
    duration=5.0,  # Report every 5 seconds
)

# Add callback method
def _on_audio_duration_report(self, duration: float) -> None:
    usage = stt.RecognitionUsage(audio_duration=duration)
    self._event_ch.send_nowait(
        stt.SpeechEvent(
            type=stt.SpeechEventType.RECOGNITION_USAGE,
            recognition_usage=usage
        )
    )

# In push_frame
def push_frame(self, frame: rtc.AudioFrame) -> None:
    # Track duration
    frame_duration = frame.samples_per_channel / frame.sample_rate
    self._audio_duration_collector.push(frame_duration)
    # ... rest of push_frame logic

# In aclose
async def aclose(self) -> None:
    # Emit final metrics
    final_duration = self._audio_duration_collector.get_total()
    if final_duration > 0:
        self._on_audio_duration_report(final_duration)
    
    await self._audio_duration_collector.aclose()
    # ... rest of cleanup

Additional Context

No response

Screenshots and Recordings

No response

karimmalhas avatar Nov 20 '25 19:11 karimmalhas

Hi, we are facing this issue as well...which basically gives the impression that the agent is not hearing the user and is stuck.

What we also notice with Soniox v3 model and plugin, is that the agent sometimes sounds like it's hearing itself when talking, we are curious if certain events are being sent by Soniox or if it's related to this. We see on occasion with Soniox that the agent is interrupting itself, initially we thought it was something with noise cancellation, but after switching to nova-3-general from Deepgram, that issue was gone. Not sure if it's related to this.

Our use case, we are using

  • LiveKit End of Turn MultiLingual model
  • Silero Vad from livekit plugins and passing it through to Soniox

So this issue is very much related to if you use Soniox End of Turn capabilities with STT from what I understand.

vvv-001 avatar Nov 23 '25 20:11 vvv-001

Hi We are experiencing the same issue. In our setup, this means about 0.5s latency increase because the agent has to wait for silence via vad, and this has a great impact. Would be absolutely amazing if we could use the marks directly as interpreted by soniox.

I see the interface works fine detecting the ends, its just not passed the right way to the rest of the stack.

I hope this bug gets fixed soon. Thank you

Max-Lumnar avatar Dec 02 '25 16:12 Max-Lumnar