Motivation

Reverb models currently requires a few steps to use.

Downloading the model from HuggingFace
Interacting with it requires the recognize_wav.py script.

We should have a simpler way for users to load the model for transcription.

Outcomes of this PR

PIP-able Package for ASR

The pyproject.toml file is updated in asr so that running pip install will install the reverb package in your python environment. This will make it easier to interact with reverb code from anywhere.

ReverbASR

This PR introduces the ReverbASR class which will setup all necessary files in in an object that a user can use to then transcribe recordings anywhere using .transcribe or .transcribe_modes. These functions also give users the full flexibility of modifying the output that recognize_wav.py does.

Automatic Model Downloading

Assuming you have setup your huggingface CLI, you can now use mdl = load_model("reverb_asr_v1") to download the reverb model to your home cache ~/.cache/reverb. This will make loading the model easier in the future as well once it's been downloaded once.

`recognize_wav.py` -> `reverb`

This PR updates recognize_wav.py to use the new ReverbASR class and includes it as a binary within the reverb package. Now you can call python wenet/bin/recognize_wav.py within the asr directory or reverb from anywhere. All previous behavior is retained however a new argument --model is added that allows a user to specify either the path to a reverb model directory that contains the checkpoint and config or the name of a pretrained reverb_asr model (for now that's only reverb_asr_v1)

Examples

Simple transcribe

>>> mdl = load_model('reverb_asr_v1')
>>> mdl.transcribe("example1.wav")
"this is is is an example output"

this is equivalent to:

reverb --model reverb_asr_v1 --audio_file example1.wav

Transcribe Nonverbatim

>>> mdl = load_model('reverb_asr_v1')
>>> mdl.transcribe("example1.wav", verbatimicity=0.0)
"this is an example output"

this is similar to:

reverb --model reverb_asr_v1 --audio_file example1.wav --verbatimicity 0.0

Transcribe Multiple Modes

>>> mdl = load_model('reverb_asr_v1')
>>> mdl.transcribe_modes("example1.wav", ["ctc_prefix_beam_search", "attention_rescoring"])
["this is is is an example output", "this is is is an example output"]

this is similar to:

reverb --model reverb_asr_v1 --audio_file example1.wav --modes ctc_prefix_beam_search attention_rescoring

Oct 09 '24 15:10 pique0822

How'd we use streaming based on the refactors in this PR? I think using .transcribe from ReverbASR from asr.wenet.cli.reverb is the way to go with simulate_streaming. Although its still not clear It'll help if you could add an example for in standalone streaming.py or pseudocode in a readme with

model = how to initialize model for streaming ()
for audio_chunk in audio_stream:
    transcript_segment = how_to_call_rev_model_in_streaming_context(audio_chunk)

Oct 10 '24 17:10 AnkushMalaker

How'd we use streaming based on the refactors in this PR? I think using .transcribe from ReverbASR from asr.wenet.cli.reverb is the way to go with simulate_streaming. Although its still not clear It'll help if you could add an example for in standalone streaming.py or pseudocode in a readme with
model = how to initialize model for streaming ()
for audio_chunk in audio_stream:
    transcript_segment = how_to_call_rev_model_in_streaming_context(audio_chunk)

I can definitely provide some guidance on how to setup streaming -- just to respond here though, from my view it won't be initialized or run any different! The key thing will be to follow what you have in your example: load the model once and then just call .transcribe on each audio chunk.

Oct 11 '24 18:10 pique0822

Provide simple interface for users of Reverb

Motivation

Outcomes of this PR

PIP-able Package for ASR

ReverbASR

Automatic Model Downloading

recognize_wav.py -> reverb

Examples

Simple transcribe

Transcribe Nonverbatim

Transcribe Multiple Modes

`recognize_wav.py` -> `reverb`