NeMo ASR: Disfluency Detection / Sentence Completion

Hello,

I am currently integrating ASR, LLM, and TTS technologies, with a specific interest in detecting the completion of sentences for more realistic interactions. I believe an effective approach to this could be the detection of speech disfluencies, such as detecting an "umm", which could then delay the response slightly to allow more time for thought.

Before tackling this issue, I have the following questions: Do the NeMo/Riva models incorporate any similar approaches or techniques for sentence completion detection that I could leverage? I have found some open-source projects with models that include weights, but I am keen on developing something compatible with Nvidia Riva. What approach would you recommend for achieving this, considering this may deviate from the standard, and I'm unsure if I can utilize the Riva client for this purpose?

On the other hand, while reading about Riva, I found this (link), which seems to indicate that it already utilizes an algorithm to detect the end of sentences to trigger a call to the punctuator model. According to the documentation, it marks the end of a sentence when 98% of the frames in an 800 ms window are silent characters. It would be very interesting to be able to modify this in order to provide more realistic experiences. However, it's not clear to me if this is the default behavior, what the default values are in this case, or if it's necessary to redeploy the models to set these values.

Thank you in advance for your guidance and suggestions.

Mar 30 '24 02:03 rodrigoGA

Values mentioned at that link are the default values and the functionality is enabled by default. In case you want to change the values, you can follow the document, regenerate the RMIR and redeploy the RMIR.

There is also a shortcut method if you want to quickly modify the values and experiment. Assuming you are using the latest 2.15.0 release, you can edit the file riva_bls_config.yaml in your model repository and there you will find relevant parameters (listed below). You can play with these to suit your use case. Once you modify this file, make sure to restart the Riva server for the modified values to take effect.

endpointing:
  endpointing_type: greedy_ctc
  residue_blanks_at_end: 0
  residue_blanks_at_start: -16
  start_history: 300
  start_th: 0.2
  stop_history: 800
  stop_th: 0.98

Apr 04 '24 16:04 virajkarandikar

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

May 05 '24 01:05 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

May 12 '24 01:05 github-actions[bot]

Values mentioned at that link are the default values and the functionality is enabled by default. In case you want to change the values, you can follow the document, regenerate the RMIR and redeploy the RMIR.

There is also a shortcut method if you want to quickly modify the values and experiment. Assuming you are using the latest 2.15.0 release, you can edit the file riva_bls_config.yaml in your model repository and there you will find relevant parameters (listed below). You can play with these to suit your use case. Once you modify this file, make sure to restart the Riva server for the modified values to take effect.
endpointing:
  endpointing_type: greedy_ctc
  residue_blanks_at_end: 0
  residue_blanks_at_start: -16
  start_history: 300
  start_th: 0.2
  stop_history: 800
  stop_th: 0.98

Thank you very much for the response @virajkarandikar , but I cannot find the file. In which path should it be?

Another question is that in Spanish, I have found that the transcriber often adds false words. Do you know if I can add VAD detection from the configuration file? Or do I have to uninstall Riva, perform a fresh installation, and deploy a model with VAD enabled to test it?

May 23 '24 13:05 rodrigoGA