NS2VC icon indicating copy to clipboard operation
NS2VC copied to clipboard

branch in V4 version train it's working ?

Open lpscr opened this issue 2 years ago • 17 comments

hi ! thank you very much for your work and this amazing repo

i try train the branch v4 i have something very wrong here when i train about 3 hours it's not change i have noise all the steps i use this

1 . python preprocess.py 2. python model1.py

29000 steps v4 branch image

in v3 or main branch after some steps i have this

5000 steps v3 or main branch image

like you see in v4 , i get only noise i do something wrong

can you please tell me in the v4 train working ? or what i do wrong

thank you for your time

lpscr avatar Dec 15 '23 17:12 lpscr

You haven't done anything wrong. Due to the model v4 having over 200 million parameters, the training process is very slow. I am currently experimenting with features such as offset noise, normalization, and cfg to make the training more stable. Your results seem quite normal, and theoretically, the convergence time of the v4 model is close to that of sd1.5. The previous three versions used smaller noise and predicted x0, resulting in faster training. However, v4 employs the classic approach of predicting noise as the target.

adelacvg avatar Dec 16 '23 03:12 adelacvg

this is so cool! I understand now. I'm going to retrain to see thank you very much for explanation and quick reply .

lpscr avatar Dec 16 '23 23:12 lpscr

@lpscr are you able to converge the model ?

rishikksh20 avatar Dec 27 '23 06:12 rishikksh20

@adelacvg checked you update the model arch on v4. Is implementation completed? and is new model converge faster? I have collected lots of audio data now waiting for GPU availability to start training.

rishikksh20 avatar Jan 08 '24 10:01 rishikksh20

Yes, the previous training process was slow to converge due to issues with the UNet. Additionally, there were semantic problems caused by a bug in the diffusion training architecture from ControlNet. The current diffusion training framework is now based on Tortoise, eliminating any semantic faults. Furthermore, the architecture employs transformer blocks without updown, leading to much faster convergence.

adelacvg avatar Jan 09 '24 17:01 adelacvg

Thanks :) Are you using HuBERT only for context vector? As my usecase is for non-english language so I thought to use Whisper layer 24 features rather than HuBERT.

rishikksh20 avatar Jan 09 '24 18:01 rishikksh20

Regarding contentvec, I chose it primarily to prevent timbre leakage. Hubert or Whisper have noticeable timbre leakage issues when trained using self-supervision. I have trained a model, and although there is some loss in audio quality during zero-shot scenarios, it performs better than the previous model on the same data scale.

adelacvg avatar Jan 17 '24 14:01 adelacvg

Hi @adelacvg Is it possible to transfer bit Prosody and style also from NS2VC architecture not just voice? For simply voice conversion it working good, although voice not match exactly but still fine

rishikksh20 avatar Feb 28 '24 11:02 rishikksh20

Certainly, but I believe that prosody and speed are better suited for GPT or an acoustic model. The diffusion part, working as a good decoder, should suffice.

adelacvg avatar Feb 28 '24 11:02 adelacvg

Just need to ask one more question, Are semantic tokens like Hu-BERT, wav2vec, and ContentVec have prosody information?

rishikksh20 avatar Feb 28 '24 11:02 rishikksh20

Of course, prosody encompasses fundamental frequency, pause duration, intonation, and other essential information. Semantic tokens inherently carry duration information and intonation.

adelacvg avatar Feb 28 '24 12:02 adelacvg

Yes, I have the same intuition because pronunciation is an integral part of linguistics.

rishikksh20 avatar Feb 28 '24 12:02 rishikksh20

Hi @adelacvg Have you checked YODAS : https://huggingface.co/datasets/espnet/yodas 370k hours dataset, although data quality is poor as music is there or some samples are empty but still good quality data for VC pretraining. If you are not GPU poor :cry: you can pretrain this to YODAS :sweat_smile:.

rishikksh20 avatar Feb 29 '24 08:02 rishikksh20

@rishikksh20 Thank you very much for your suggestion. However, I'm currently short on GPU resources, and all GPUs are being used for experiments with the AR TTS model based on GPT. The pre-trained model may be trained when there are available GPUs.

adelacvg avatar Mar 01 '24 05:03 adelacvg

@adelacvg Everyone is GPU-poor, I am also waiting for my GPU to be vacated. By the way how's the progress with TTTS training do you have any sample to share? I have tested the Hierspeech++'s Non-autoregressive Text to vector module along with NS2VC which acts as end-to-end TTS, and it is performing well. GPT-based Text to Vector which I have tested before shows lots of hallucination.

rishikksh20 avatar Mar 01 '24 06:03 rishikksh20

@rishikksh20 The model in the master branch of TTTS is based on Tortoise, and the results are comparable to Tortoise. I have provided a Colab link for testing the pre-trained model. For the v2 version, I would like to use a training method similar to Valle's, while still using Diffusion as the decoder, with the hope of achieving better zero-shot results.

adelacvg avatar Mar 01 '24 07:03 adelacvg

For v4 I am planning to train on Encodec features for better speaker generalization as commented here https://github.com/adelacvg/NS2VC/issues/16#issuecomment-2084663655 . Has anyone tried this before or like to give me any heads up thought?

rishikksh20 avatar Apr 30 '24 08:04 rishikksh20