NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Identifying Spoken Language

Open Sasha-Bachynskyi opened this issue 3 years ago • 1 comments

Hello, developers. Is there a model or something to identify spoken language? For example, how to identify whether a speaker speaks English or Russian. I looked for it in the tutorials and found nothing. I will appreciate any help

Sasha-Bachynskyi avatar Sep 08 '22 11:09 Sasha-Bachynskyi

@fayejf is the model published? Please point to the docs.

nithinraok avatar Sep 21 '22 03:09 nithinraok

It looks like there is a labeller, see https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_classification/speech_to_label.py#L81

jnnnnn avatar Sep 30 '22 10:09 jnnnnn

@jnnnnn @Sasha-Bachynskyi The model is published. Thanks for your patience. https://github.com/NVIDIA/NeMo/pull/5080

fayejf avatar Oct 05 '22 17:10 fayejf

Hi, @fayejf!

I can't figure out how to use this model. There is only an instance of how to initialize a model. Could you give an example of what method I should call and how to pass the audio file in?

Thank you in advance for helping!

Sasha-Bachynskyi avatar Nov 15 '22 08:11 Sasha-Bachynskyi

Hi @Sasha-Bachynskyi , PR to merge info regarding docs should be merged soon. https://github.com/NVIDIA/NeMo/pull/5366

You may infer the label using EncDecSpeakerLabelModel class. https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/api.html#nemo.collections.asr.models.EncDecSpeakerLabelModel

For inferencing on single audio file use get_label method. Instead for inferencing on multiple files use batch_inference

nithinraok avatar Nov 15 '22 18:11 nithinraok

Hi @nithinraok, I'm sorry for bothering you. I want to identify the spoken language in a single file.

I use the following instruction

Below is my code:

import nemo.collections.asr as nemo_asr

langid_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="langid_ambernet")

lang = langid_model.get_label('audio.wav')

But, I get an error:

Traceback (most recent call last):
  File "/home/denis/test_lang/test-lang.py", line 5, in <module>
    lang = vad_model.get_label('audio.wav')
  File "/home/denis/anaconda3/envs/nemo2/lib/python3.9/site-packages/nemo/collections/asr/models/label_models.py", line 455, in get_label
    _, logits = self.infer_file(path2audio_file=path2audio_file)
  File "/home/denis/anaconda3/envs/nemo2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/denis/anaconda3/envs/nemo2/lib/python3.9/site-packages/nemo/collections/asr/models/label_models.py", line 427, in infer_file
    audio = librosa.core.resample(audio, sr, target_sr)
TypeError: resample() takes 1 positional argument but 3 were given

It seems that there is something wrong with librosa

System info: Nvidia video A40 Nemo - branch main, installed 22th of February 2023 librosa - 0.10.0

What can it be? I'd appreciate any help in advance

Sasha-Bachynskyi avatar Feb 22 '23 12:02 Sasha-Bachynskyi

Looks like librosa is expecting mandatory naming args from newest version. Lower your librosa version or use the fix provided at https://github.com/NVIDIA/NeMo/pull/6086

nithinraok avatar Feb 22 '23 23:02 nithinraok