concentus Voice Activity Detection

I would like to use Concentus in an app that does Speech-To-Text conversion. I need to be able to detect the end of sentences by monitoring voice activity and identifying segments of speech terminated by periods of silence. I know Opus has Voice Activity Detection, but looking through the Concentus source code, VAD seems to only be used in internal classes for DTX, with no exposed public classes/methods. Ideally I'd be able to poll the encoder and get a count of recent consecutive silence frames, then capture the sentence after a the silence frame threshold has been reached, and then submit that sentence to the STT engine.

Is there any way to get access to the built-in VAD status on the encoder? Or any other way to achieve what I want to achieve?

And thankyou for this library!!!! :)

Feb 16 '18 20:02 sjpritchard

Yeah, actually you're not the first person to suggest this in regards to Opus. Here's a mailing list discussion I found as an example.

In principle it should be easy to expose VAD state; you'd just add a getter in OpusEncoder.cs that pulls the value out of the SILK state. A few caveats come with this approach though:

You'd have to run the encoder in a specific mode for the results to be valid. I believe ForceMode = Silk, Bitrate < 40Kbps, Complexity > 4 is close to what you would need
You'd have to be actually doing Encode() on some audio in order to do this processing, which is non-trivial on processor resources especially for high-complexity SILK which is the slowest part of the whole codec. Possibly you could rip out other costly parts of the codec if you wanted to really optimize this process and didn't care about actually encoding anything, but it would be messy at best

There are a few alternatives as well

As Jean-Marc suggested in the mailing list, there is a separate VAD process that runs in Analysis.cs which also can expose a speech probability. This uses a different approach (multi-layer perceptron) but should be able to give you reasonable numbers much quicker than SILK, if you can manage to run the analysis outside of the rest of the codec
In the past I have successfully integrated PocketSphinx (from the CMU Sphinx project) into C# projects using P/Invoke. Pocketsphinx can be configured as a phonetic voice activity detector, and it would use a fraction of the CPU that the opus encoder would

Feb 17 '18 02:02 lostromb

Thanks for the suggestions - I'll take a look. I was also looking through the Opus RFC and wondered if I might be able to directly inspect each encoded frame, as it appears that the Silk layer of each frame has a VAD flag set. If each frame is a constant time period, I might be use this flag as a counter.

Feb 17 '18 02:02 sjpritchard

Thanks for the suggestions - I'll take a look. I was also looking through the Opus RFC and wondered if I might be able to directly inspect each encoded frame, as it appears that the Silk layer of each frame has a VAD flag set. If each frame is a constant time period, I might be use this flag as a counter.

Hello @sjpritchard

Were u successful. I aslo want to do the same thing. Can you tell how did you acheive it Regards,

Apr 19 '20 15:04 viju2008

Hello,

any news on that? Thank you for your help!

Best regards, julian-w

May 25 '21 07:05 julian-w