Bug Description

Bug Report: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics

Description

The Soniox STT plugin has two critical bugs that prevent it from working properly with LiveKit's turn detection and usage tracking:

Missing END_OF_SPEECH events: The plugin receives the <end> token from Soniox but never emits SpeechEventType.END_OF_SPEECH, so LiveKit sessions can't close turns properly even though Soniox has finished the utterance.
Missing RecognitionUsage metrics: The plugin doesn't emit RecognitionUsage events, making it impossible to track audio duration for billing and analytics.

Impact

Applications cannot use turn_detection_mode="stt" with Soniox
No way to track audio usage for billing/analytics
Users miss out on Soniox's excellent semantic endpoint detection

Environment

Running in production on Breez voice agent platform
Tested with LiveKit sessions using Soniox STT
Wrapper code proves the fix works correctly

Additional Context

This was identified in collaboration with Klemen (Soniox CEO) and Bojan from Soniox team. The fix follows the same pattern as other STT plugins (Deepgram) and is fully backwards compatible.

Happy to provide more details or submit a PR if helpful!

Expected Behavior

When Soniox sends a FINAL_TRANSCRIPT (with <end> token), the plugin should emit a SpeechEventType.END_OF_SPEECH event
The plugin should track audio frame durations and periodically emit RecognitionUsage events (similar to Deepgram plugin)

Reproduction Steps

## Reproduction

1. Create a LiveKit session with Soniox STT
2. Set `turn_detection_mode="stt"`
3. Observe that turns never close even though Soniox sends `<end>` token
4. Try to capture RecognitionUsage metrics - none are emitted

Operating System

docker

Models Used

soniox-tts

Package Versions

livekit-plugins-soniox==1.3.1

Session/Room/Call IDs

No response

Proposed Solution

## Proposed Fix

We've been running a wrapper in production that fixes both issues. The implementation is straightforward:

### Fix 1: Emit END_OF_SPEECH after FINAL_TRANSCRIPT


# In the event forwarding loop (_run method or similar)
async for event in self._inner_stream:
    self._event_ch.send_nowait(event)
    
    # Add this:
    if event.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
        try:
            self._event_ch.send_nowait(
                stt.SpeechEvent(type=stt.SpeechEventType.END_OF_SPEECH)
            )
        except Exception as e:
            logger.warning(f"Failed to emit END_OF_SPEECH: {e}")


### Fix 2: Track and Emit RecognitionUsage Metrics


# Import
from livekit.plugins.deepgram._utils import PeriodicCollector

# In __init__
self._audio_duration_collector = PeriodicCollector(
    callback=self._on_audio_duration_report,
    duration=5.0,  # Report every 5 seconds
)

# Add callback method
def _on_audio_duration_report(self, duration: float) -> None:
    usage = stt.RecognitionUsage(audio_duration=duration)
    self._event_ch.send_nowait(
        stt.SpeechEvent(
            type=stt.SpeechEventType.RECOGNITION_USAGE,
            recognition_usage=usage
        )
    )

# In push_frame
def push_frame(self, frame: rtc.AudioFrame) -> None:
    # Track duration
    frame_duration = frame.samples_per_channel / frame.sample_rate
    self._audio_duration_collector.push(frame_duration)
    # ... rest of push_frame logic

# In aclose
async def aclose(self) -> None:
    # Emit final metrics
    final_duration = self._audio_duration_collector.get_total()
    if final_duration > 0:
        self._on_audio_duration_report(final_duration)
    
    await self._audio_duration_collector.aclose()
    # ... rest of cleanup

Additional Context

No response

Screenshots and Recordings

No response

Nov 20 '25 19:11 karimmalhas

Hi, we are facing this issue as well...which basically gives the impression that the agent is not hearing the user and is stuck.

What we also notice with Soniox v3 model and plugin, is that the agent sometimes sounds like it's hearing itself when talking, we are curious if certain events are being sent by Soniox or if it's related to this. We see on occasion with Soniox that the agent is interrupting itself, initially we thought it was something with noise cancellation, but after switching to nova-3-general from Deepgram, that issue was gone. Not sure if it's related to this.

Our use case, we are using

LiveKit End of Turn MultiLingual model
Silero Vad from livekit plugins and passing it through to Soniox

So this issue is very much related to if you use Soniox End of Turn capabilities with STT from what I understand.

Nov 23 '25 20:11 vvv-001

Hi We are experiencing the same issue. In our setup, this means about 0.5s latency increase because the agent has to wait for silence via vad, and this has a great impact. Would be absolutely amazing if we could use the marks directly as interpreted by soniox.

I see the interface works fine detecting the ends, its just not passed the right way to the rest of the stack.

I hope this bug gets fixed soon. Thank you

Dec 02 '25 16:12 Max-Lumnar

Bug: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics

Bug Description

Bug Report: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics

Description

Impact

Environment

Additional Context

Expected Behavior

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings