wespeaker icon indicating copy to clipboard operation
wespeaker copied to clipboard

[Question] Is this training log normal or not?

Open hungnvk54 opened this issue 10 months ago • 10 comments

Hi,

First of all, I want to thank for your contribution.

Today, I use your example to retrain voxceleb/Resnet34. Dataset is default vox1, vox2 (download via your default script utils) and my private 800 ids (I mixed my private data with vox2 data).

After training 1 epoch, the training log output is:

Image

The Acc is only 0.015 - 0.04 for the first epoch. Is this normal or not.

(I want to ask this question because the training time is to long (one week), wait util finish tranining is not good idea)

Thank you in advance.

hungnvk54 avatar Jun 23 '25 08:06 hungnvk54

@hungnvk54 I think it is normal. The picture shows the training early, only 100 batches have passed, and no complete epoch has passed.

cdliang11 avatar Jun 23 '25 09:06 cdliang11

I'm using shard datatype. So batch size here is calculated by number of audio file or by number of shard?

hungnvk54 avatar Jun 23 '25 09:06 hungnvk54

number of audio file

cdliang11 avatar Jun 23 '25 09:06 cdliang11

@cdliang11 Thank for your response

If it is audio file, It mean we need 1.5days/epoch. So it is too long.

I'm traning on server with two GPU RTX 2080Ti. Is it normal?

And which server the pretrained model?

hungnvk54 avatar Jun 23 '25 09:06 hungnvk54

  1. it's abnormal.
  2. pretrained model: https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md

cdliang11 avatar Jun 23 '25 10:06 cdliang11

Thanks @cdliang11 We've fix the issue because the RAM is to small. Now we train on 02 RXT3090, It takes 3minutes/100 batch size. Is is normal:

Image

hungnvk54 avatar Jun 24 '25 08:06 hungnvk54

Speed bottleneck should be related to cpu and IO, you can try to increase num_workers.

cdliang11 avatar Jun 24 '25 16:06 cdliang11

Hi @cdliang11 , Can you share the configure for traning Resnet221. In the voxceleb/v2, There are no example configure for Resnet211. I found the Resnet211_LM in huggingface, but it seems that it is configure for LM, the postprocess of from the Resnet221?

hungnvk54 avatar Jun 25 '25 01:06 hungnvk54

I checked voxceleb_resnet221_LM.yaml on huggingface. But, I found that it should be voxceleb_resnet221.yaml. Sorry, it might be a typo. You can use it directly. https://huggingface.co/Wespeaker/wespeaker-voxceleb-resnet221-LM/blob/main/voxceleb_resnet221_LM.yaml

cdliang11 avatar Jun 25 '25 04:06 cdliang11

@hungnvk54 I think it is normal. The picture shows the training early, only 100 batches have passed, and no complete epoch has passed.

Image

Hi @cdliang11 I have a question about my custom model training. I'm in the first epoch and while the loss is decreasing, the accuracy is still 0. Is this normal behavior for the beginning of training, or could this indicate a problem? Thanks for your help!

MM-WW55 avatar Jul 29 '25 07:07 MM-WW55