agents icon indicating copy to clipboard operation
agents copied to clipboard

为啥stt还能收到很多静默音频

Open zhaojiangbing opened this issue 1 month ago • 8 comments

Feature Type

Nice to have

Feature Description

vad应该会拦截静默音频,但是我实现stt模块,还能得到静默音频,为啥呢?

Workarounds / Alternatives

No response

Additional Context

No response

zhaojiangbing avatar Dec 11 '25 09:12 zhaojiangbing

Hi, do you mean you pre-process audio in stt_node? Could you share more on details on what you are trying to accomplish?

tinalenguyen avatar Dec 11 '25 20:12 tinalenguyen

Hi, do you mean you pre-process audio in stt_node? Could you share more on details on what you are trying to accomplish?

stt模块是在vad之后,stt模块理论上不会收到大量静默音频吧?

zhaojiangbing avatar Dec 12 '25 01:12 zhaojiangbing

are you using a streaming STT or non-streaming? if streaming, all audio frames will be sent to the STT, otherwise if it's a non-stream STT, maybe you can share some audio clips STT received for better understanding the issue.

longcw avatar Dec 12 '25 03:12 longcw

are you using a streaming STT or non-streaming? if streaming, all audio frames will be sent to the STT, otherwise if it's a non-stream STT, maybe you can share some audio clips STT received for better understanding the issue.

我用的stream STT, 那我怎么区分是静默音频,还是有人声的音频呢

zhaojiangbing avatar Dec 12 '25 03:12 zhaojiangbing

if it's streaming STT, it's the responsibility of STT to detect the speaking and end of user turn, otherwise you can use the non-streaming mode

  • streaming: all audio frames sent to the STT in stream mode, and STT returns transcripts when it detects any
  • non-streaming: only speaking clips detected by VAD will be sent to STT

longcw avatar Dec 12 '25 03:12 longcw

if it's streaming STT, it's the responsibility of STT to detect the speaking and end of user turn, otherwise you can use the non-streaming mode

  • streaming: all audio frames sent to the STT in stream mode, and STT returns transcripts when it detects any
  • non-streaming: only speaking clips detected by VAD will be sent to STT

non-streaming 怎么设置或配置呢?

zhaojiangbing avatar Dec 12 '25 04:12 zhaojiangbing

if it's streaming STT, it's the responsibility of STT to detect the speaking and end of user turn, otherwise you can use the non-streaming mode

  • streaming: all audio frames sent to the STT in stream mode, and STT returns transcripts when it detects any
  • non-streaming: only speaking clips detected by VAD will be sent to STT

Image 这是我的代码

zhaojiangbing avatar Dec 12 '25 04:12 zhaojiangbing

if it's a custom STT, it should be defined via STTCapabilities.streaming=False, for example.

longcw avatar Dec 15 '25 03:12 longcw