piper New non-English voice model from scratch, need good help, please

Hi @synesthesiam and other skilled Piper's

Hi, I would like to train a good high quality voice model for Norwegian. Today, only a medium synthesis exist which is terrible to listen at for us Norwegians. Earlier, I've trained/cloned English speaking voices using English-voice checkpoints, with good results.

Now I have about 13 hours (approx 10.000 wav files, 2-10 seconds length each) of a Norwegian studio quality voice, all prepared in LJ format, ready to go. But there is so little information available online for training from scratch, so I have some questions;

Is it the same procedure for training from scratch as fine-tuning, only change to "from scratch", and run the procedures?
For a completely new voice, how many epoch would be a good starting point to achieving a high quality onnx?
When I finally have a new voice (female), can I later use this voice as a checkpoint to train a new male voice (Norwegian)?
For training a brand new Norwegian voice, is training from scratch the best option, or is still fine-tuning from an non-Norwegian checkpoint a good alternative? I want the best option.

I do have access to multiple GPU (A4000's) for the training process.

In advance, thanks for any help and support on this!

Nov 28 '24 22:11 TheStigh

Hi @TheStigh,

Late answer so you might have figured it out by now. I have been experimenting with a new Swedish model from scratch, both single and multi speaker. This is my experiences.

Yes, same procedure to train from scratch, follow the training guide and skip the --resume_from_checkpoint or --resume_from_single_speaker_checkpoint flags
From my experience it starts to be understandable from a couple hundred epochs. For Swedish it will have problems with leading "h" in for example "hej" up until 500-ish, from 700 that seems to be pretty much gone. At 700 you still have strange pauses and not all pronunciations are spot on. According to the training guide you need about 2000 epochs, but that depends on the size of your dataset as well.
Yes you can add a voice later on. The documentation for PyTorch Lightning states that the chpt-file should contain everything you need to continue, but in #716 it is stated that you need the dataset for the first voice. Can't confirm or deny that, some clarifications would be appreciated.
I haven't tried so I can't tell in this case.

BR

Feb 23 '25 01:02 a-n-lundgren

Hi @TheStigh,

Late answer so you might have figured it out by now. I have been experimenting with a new Swedish model from scratch, both single and multi speaker. This is my experiences.

Yes, same procedure to train from scratch, follow the training guide and skip the --resume_from_checkpoint or --resume_from_single_speaker_checkpoint flags

From my experience it starts to be understandable from a couple hundred epochs. For Swedish it will have problems with leading "h" in for example "hej" up until 500-ish, from 700 that seems to be pretty much gone. At 700 you still have strange pauses and not all pronunciations are spot on. According to the training guide you need about 2000 epochs, but that depends on the size of your dataset as well.

Yes you can add a voice later on. The documentation for PyTorch Lightning states that the chpt-file should contain everything you need to continue, but in Adding new speaker to the existing model #716 it is stated that you need the dataset for the first voice. Can't confirm or deny that, some clarifications would be appreciated.

I haven't tried so I can't tell in this case.

BR

Hi @a-n-lundgren

Thanks so much for the much needed feedback! Are you an Discord? Could you look me up? "TheStigh". Would love to chat more on this.

Feb 23 '25 01:02 TheStigh

@a-n-lundgren do you have any progress worth sharing at this stage? The currently available Swedish model for piper has quite some quirks to it.

I've followed Kungliga Biblioteket's (KB's) work in the space for quite some time as they basically seem to have driven new training of Swedish models the past few years. Given a reply on their huggingface they might have time to train a new piper model later this year.

They did recently release a large dataset for Swedish audio (rixvox). Perhaps it's useful for initial training before training a voice? Slightly off topic for piper, but KB also released a set of whisper models trained on that dataset. The models perform really well. I put together a PoC to get it to work with the Wyoming protocol. I guess it could be integrated in faster-whisper with additional work.

Feb 23 '25 20:02 AlexGustafsson

@TheStigh, sure I'll contact you on Discord.

@AlexGustafsson, agree and that is what got me started. Not yet anything impressive to share other than multiple ways how NOT to do stuff. ;) Current published progress can be found here SubZeroAI/piper-swedish-tts-multispeaker

I'm currently working on a new Swedish multi speaker model based on a combination of NST, Waxholm, Fleurs and Rixvox. Have had some issues getting the training to work, seems to be too large of a dataset. The training fails due to segmentation fault at the end of the first epoch, currently troubleshooting that. Getting pretty good results on parts of the dataset, have been using it in Home Assistant for a while.

Feb 23 '25 21:02 a-n-lundgren

To me not having trained ANY dataset using Piper yet, it sounds to me like you're running out of VRAM?

Feb 24 '25 05:02 FrontierDK

It's a good guess, but I don't think that is the problem. I don't get the standard "torch.cuda.OutOfMemoryError: CUDA out of memory." but instead I get the standard "Segmentation fault". Don't know yet if it is in RAM or vRAM the segfault occurs, trying to find out how to work around it. Currently ramping it up using --limit_train_batches to find where the limit is.

Feb 24 '25 06:02 a-n-lundgren

@a-n-lundgren how much data are you using for training your models? what would be considered a resonable length of audio recordings for both training from scratch and fine tuning? @TheStigh how were your results?

Mar 05 '25 08:03 SMRehan96

@SMRehan96 I would say that you get a reasonable result, when training from scratch, with a couple of thousand audio files. In my first try I trained on roughly 8,000 files, I later extended and are now training on 26,500 files, the full dataset. I'm using validation-split 0.01 and --num-test-examples 100 so I can verify the progress directly in TensorBoard.

The length of an audio file should not exceed 30 seconds for practical reasons, you will need more vRAM for longer audio files and files with more phonemes. Also, keep the dataset small (<30,000 files) depending on you GPU.

For fine tuning you don't need as much data, but how much depends on the model you are starting with and how much your new dataset is deviates from the model. Sorry, this is a "how long is a rope" kind of question.

One epoch is one training cycle over the complete dataset. One step is a cycle over the batch size, so the number of steps for one epoch is [number of files in dataset] / [batch size] = [number of steps]. The internal parameters are adjusted for each step.

With the above definitions you could reason that a dataset with 200,000 files would give more internal adjustments for one epoch and thus better results for one epoch than for a dataset with only 20,000 files. This does not seem to be true in my case and the larger dataset only consumes more vRAM causing a smaller batch size in training. Therefore I have divided my larger datasets into smaller sets that works for my amount of vRAM, which is 24GB at the moment.

Mar 05 '25 10:03 a-n-lundgren

Oh man, I have 800,000 dataset, dual mixed hundreds of speakers but divided by KMean to male and female (Vietnamese). Epoch 5 can hear, terrible quality :). Take 2 days (1 GPU 24G) for 5 epochs, gees it would than take months for thousands of epochs ? Will update.

Cheers, Steve

Apr 03 '25 01:04 thusinh1969