Nithin Rao
Nithin Rao
VAD and Speaker Embedding extractor models you used are outdated. Which NeMo version are you using? Please use [vad_telephony_marblenet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_telephony_marblenet) for VAD and [titanet_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/titanet_large) for Speaker Embedding extractor
Hi @alamnasim , we train speaker embedding extractors as speaker identification tasks. In these tasks in order to evaluate your model on the dev set, the dev set needs to...
Did you change the learning rate? try with 0.001 as the learning rate and wd=0.0002.
Please share your config file if possible.
I would start experimenting by increasing batch_size to 128 and lowering lr to 0.0001, since you have only one GPU. Though adamw doesn't show changes with weight_decay, I would still...
I was training with adam optimizer and loss seems fine to me. how many samples in your dataset?
Thanks. Can you send a PR, will review the changes.
You can use this function to get batch embeddings https://github.com/NVIDIA/NeMo/blob/4cd9b3449cbfedc671348fbabbe8e3a55fbd659d/nemo/collections/asr/models/label_models.py#L420 Once you get embeddings you can compare those embeddings using cosine similarity score. For example, you can view this script...
Depends on your use case. You could try averaging cosine scores or average embeddings of each utterance per speaker ( if you have many samples per speaker). There is no...
For the above example, both are 192-dimensional vectors you can average along this dimension. You would get a 192-dimensional embedding. There is no constraint on the duration of the file,...