Provide simple interface for users of Reverb
Motivation
Reverb models currently requires a few steps to use.
- Downloading the model from HuggingFace
- Interacting with it requires the recognize_wav.py script.
We should have a simpler way for users to load the model for transcription.
Outcomes of this PR
PIP-able Package for ASR
The pyproject.toml file is updated in asr so that running pip install will install the reverb package in your python environment. This will make it easier to interact with reverb code from anywhere.
ReverbASR
This PR introduces the ReverbASR class which will setup all necessary files in in an object that a user can use to then transcribe recordings anywhere using .transcribe or .transcribe_modes. These functions also give users the full flexibility of modifying the output that recognize_wav.py does.
Automatic Model Downloading
Assuming you have setup your huggingface CLI, you can now use mdl = load_model("reverb_asr_v1") to download the reverb model to your home cache ~/.cache/reverb. This will make loading the model easier in the future as well once it's been downloaded once.
recognize_wav.py -> reverb
This PR updates recognize_wav.py to use the new ReverbASR class and includes it as a binary within the reverb package. Now you can call python wenet/bin/recognize_wav.py within the asr directory or reverb from anywhere. All previous behavior is retained however a new argument --model is added that allows a user to specify either the path to a reverb model directory that contains the checkpoint and config or the name of a pretrained reverb_asr model (for now that's only reverb_asr_v1)
Examples
Simple transcribe
>>> mdl = load_model('reverb_asr_v1')
>>> mdl.transcribe("example1.wav")
"this is is is an example output"
this is equivalent to:
reverb --model reverb_asr_v1 --audio_file example1.wav
Transcribe Nonverbatim
>>> mdl = load_model('reverb_asr_v1')
>>> mdl.transcribe("example1.wav", verbatimicity=0.0)
"this is an example output"
this is similar to:
reverb --model reverb_asr_v1 --audio_file example1.wav --verbatimicity 0.0
Transcribe Multiple Modes
>>> mdl = load_model('reverb_asr_v1')
>>> mdl.transcribe_modes("example1.wav", ["ctc_prefix_beam_search", "attention_rescoring"])
["this is is is an example output", "this is is is an example output"]
this is similar to:
reverb --model reverb_asr_v1 --audio_file example1.wav --modes ctc_prefix_beam_search attention_rescoring
How'd we use streaming based on the refactors in this PR?
I think using .transcribe from ReverbASR from asr.wenet.cli.reverb is the way to go with simulate_streaming.
Although its still not clear
It'll help if you could add an example for in standalone streaming.py or pseudocode in a readme with
model = how to initialize model for streaming ()
for audio_chunk in audio_stream:
transcript_segment = how_to_call_rev_model_in_streaming_context(audio_chunk)
How'd we use streaming based on the refactors in this PR? I think using .transcribe from
ReverbASRfromasr.wenet.cli.reverbis the way to go with simulate_streaming. Although its still not clear It'll help if you could add an example for in standalone streaming.py or pseudocode in a readme withmodel = how to initialize model for streaming () for audio_chunk in audio_stream: transcript_segment = how_to_call_rev_model_in_streaming_context(audio_chunk)
I can definitely provide some guidance on how to setup streaming -- just to respond here though, from my view it won't be initialized or run any different! The key thing will be to follow what you have in your example: load the model once and then just call .transcribe on each audio chunk.