Bug: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics
Bug Description
Bug Report: Soniox STT Plugin Missing END_OF_SPEECH Events and RecognitionUsage Metrics
Description
The Soniox STT plugin has two critical bugs that prevent it from working properly with LiveKit's turn detection and usage tracking:
-
Missing END_OF_SPEECH events: The plugin receives the
<end>token from Soniox but never emitsSpeechEventType.END_OF_SPEECH, so LiveKit sessions can't close turns properly even though Soniox has finished the utterance. -
Missing RecognitionUsage metrics: The plugin doesn't emit
RecognitionUsageevents, making it impossible to track audio duration for billing and analytics.
Impact
- Applications cannot use
turn_detection_mode="stt"with Soniox - No way to track audio usage for billing/analytics
- Users miss out on Soniox's excellent semantic endpoint detection
Environment
- Running in production on Breez voice agent platform
- Tested with LiveKit sessions using Soniox STT
- Wrapper code proves the fix works correctly
Additional Context
This was identified in collaboration with Klemen (Soniox CEO) and Bojan from Soniox team. The fix follows the same pattern as other STT plugins (Deepgram) and is fully backwards compatible.
Happy to provide more details or submit a PR if helpful!
Expected Behavior
Expected Behavior
- When Soniox sends a FINAL_TRANSCRIPT (with
<end>token), the plugin should emit aSpeechEventType.END_OF_SPEECHevent - The plugin should track audio frame durations and periodically emit
RecognitionUsageevents (similar to Deepgram plugin)
Reproduction Steps
## Reproduction
1. Create a LiveKit session with Soniox STT
2. Set `turn_detection_mode="stt"`
3. Observe that turns never close even though Soniox sends `<end>` token
4. Try to capture RecognitionUsage metrics - none are emitted
Operating System
docker
Models Used
soniox-tts
Package Versions
livekit-plugins-soniox==1.3.1
Session/Room/Call IDs
No response
Proposed Solution
## Proposed Fix
We've been running a wrapper in production that fixes both issues. The implementation is straightforward:
### Fix 1: Emit END_OF_SPEECH after FINAL_TRANSCRIPT
# In the event forwarding loop (_run method or similar)
async for event in self._inner_stream:
self._event_ch.send_nowait(event)
# Add this:
if event.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
try:
self._event_ch.send_nowait(
stt.SpeechEvent(type=stt.SpeechEventType.END_OF_SPEECH)
)
except Exception as e:
logger.warning(f"Failed to emit END_OF_SPEECH: {e}")
### Fix 2: Track and Emit RecognitionUsage Metrics
# Import
from livekit.plugins.deepgram._utils import PeriodicCollector
# In __init__
self._audio_duration_collector = PeriodicCollector(
callback=self._on_audio_duration_report,
duration=5.0, # Report every 5 seconds
)
# Add callback method
def _on_audio_duration_report(self, duration: float) -> None:
usage = stt.RecognitionUsage(audio_duration=duration)
self._event_ch.send_nowait(
stt.SpeechEvent(
type=stt.SpeechEventType.RECOGNITION_USAGE,
recognition_usage=usage
)
)
# In push_frame
def push_frame(self, frame: rtc.AudioFrame) -> None:
# Track duration
frame_duration = frame.samples_per_channel / frame.sample_rate
self._audio_duration_collector.push(frame_duration)
# ... rest of push_frame logic
# In aclose
async def aclose(self) -> None:
# Emit final metrics
final_duration = self._audio_duration_collector.get_total()
if final_duration > 0:
self._on_audio_duration_report(final_duration)
await self._audio_duration_collector.aclose()
# ... rest of cleanup
Additional Context
No response
Screenshots and Recordings
No response
Hi, we are facing this issue as well...which basically gives the impression that the agent is not hearing the user and is stuck.
What we also notice with Soniox v3 model and plugin, is that the agent sometimes sounds like it's hearing itself when talking, we are curious if certain events are being sent by Soniox or if it's related to this. We see on occasion with Soniox that the agent is interrupting itself, initially we thought it was something with noise cancellation, but after switching to nova-3-general from Deepgram, that issue was gone. Not sure if it's related to this.
Our use case, we are using
- LiveKit End of Turn MultiLingual model
- Silero Vad from livekit plugins and passing it through to Soniox
So this issue is very much related to if you use Soniox End of Turn capabilities with STT from what I understand.
Hi
We are experiencing the same issue. In our setup, this means about 0.5s latency increase because the agent has to wait for silence via vad, and this has a great impact. Would be absolutely amazing if we could use the
I see the interface works fine detecting the ends, its just not passed the right way to the rest of the stack.
I hope this bug gets fixed soon. Thank you