FreeVC icon indicating copy to clipboard operation
FreeVC copied to clipboard

Some observations after 690k steps

Open skol101 opened this issue 3 years ago • 33 comments

For unseen F to seen M conversion, the resulting pitch is very close to the source speaker , especially if the source pitch is much higher than seen M pitch.

I've used SR-based data augmentation step.

  1. Unseen F from LibriTTS test-clean 5142_36600_000006_000000.wav
  2. Seen speaker (p326) audio used as a reference audio

https://user-images.githubusercontent.com/53978091/211267486-b8551ae6-5f91-450e-8a4d-758b973b2c17.mp4 3. Conversion result unseen F to seen M

https://user-images.githubusercontent.com/53978091/211268012-48e6d8ec-d845-49ec-bcda-fd2a01ed3f46.mp4

skol101 avatar Jan 09 '23 08:01 skol101

I tried some large-pitch-gap conversions, including the mentioned 5142-to-p326 conversion, and some other conversions like a high-pitch kid's voice to another low-pitch man's voice. I put the results here. So, seems that your issue is not caused by the large pitch gap 🤔.

OlaWod avatar Jan 09 '23 13:01 OlaWod

I've followed your training instructions to the T, with one slight difference: I haven't created test.txt in prepare_flist, though this shouldn't have any effect on training results.

skol101 avatar Jan 09 '23 15:01 skol101

Hmmm, weired. #21 also reports worse results than mine. Tomorrow I'll check the results of my checkpoint trained with data_utils.py to see if this is the problem. I thought it would provide better performance than data_utils_old.py though, cause it loads more data in a batch. Other than that I can't think of any other reason now.

OlaWod avatar Jan 09 '23 16:01 OlaWod

Also, is 64 batch size could be the issue. I understand that's to utilise as much memory of the rtx 3090 as possible.

skol101 avatar Jan 10 '23 06:01 skol101

Tested checkpoint trained with data_utils.py, it does has this conversion failure problem. Sorry for the bug, I'll update the code later. Though, I can not figure out why this happens, it just loads larger amounts of data. Maybe I also messed up something when I clean up my code? Please ping me if failure still exists, and I'll check up my old dirty code to see if I missed something.

OlaWod avatar Jan 10 '23 08:01 OlaWod

I see, so it's not as simple as renaming data_utils_old.py to data_utils.py ?

skol101 avatar Jan 10 '23 09:01 skol101

Tested again, it does not have conversion failure. In my first test I just passed freevc-s.json to load freevc. But, the resulted speech has some 'splitting up' voice. data_utils.py does has problem, the concatenation deteriorates model performance. data_utils_old.py is from the code before my cleaning up, and has many old-version variable names. I'll update it later, after I make sure nothing is wrong. (Hope I do not mess up this time)

OlaWod avatar Jan 10 '23 10:01 OlaWod

Oh yes, "splitting voice" is something I'm hearing , too, in the generated audio.

skol101 avatar Jan 10 '23 11:01 skol101

Can this model be fine-tuned to a custom dataset by continuning from a pre-trained checkpoint?

skol101 avatar Jan 10 '23 11:01 skol101

Yes. Btw code updated.

OlaWod avatar Jan 10 '23 11:01 OlaWod

Ok, I'll first try finetuning

skol101 avatar Jan 10 '23 11:01 skol101

I have continued from the pretrained model, but the logs are saying that current step is only 130400, not 900k. I've downloaded generator and discriminator from here https://onedrive.live.com/?authkey=%21AOOs5nZpsLC4ECE&id=537643E55991EE7B%219178&cid=537643E55991EE7B

But generator freevc.pth was last updated on 14/09/2022, whilst discriminator on 30/11/2022. Maybe there's an old version of the pre-trained generator uploaded to OneDrive?

INFO:freevc:====> Epoch: 1372
INFO:freevc:Train Epoch: 1373 [63%]
INFO:freevc:[2.479205369949341, 2.2820231914520264, 11.992838859558105, 19.632217407226562, 2.451157331466675, 130400, 0.00016841507612184626]
 

skol101 avatar Jan 10 '23 12:01 skol101

  1. that means you did not use batch size 64, 1 gpu, as I do. The displayed step can be influenced by batch size, number of gpu, etc.
  2. The displayed time is the time when I drag the file from server to my personal computer, both generator and discriminator are born at this time ↓. 160B277884969DE68BB94BB1D554BDB9

OlaWod avatar Jan 10 '23 13:01 OlaWod

I've started fine tuning with 2 gpus, but the same batch size . Good to know. I'll observe tuning results.

skol101 avatar Jan 10 '23 13:01 skol101

I noticed that data_utils_old includes a speaker ID along with the audio data. I don't see any usage of speaker ID in the version of train that got committed; was it used in a previous version? Seems like that could be beneficial for informing conversion.

space-pope avatar Jan 10 '23 18:01 space-pope

I noticed that data_utils_old includes a speaker ID along with the audio data. I don't see any usage of speaker ID in the version of train that got committed; was it used in a previous version? Seems like that could be beneficial for informing conversion.

nope, it's just an unused variable. As can be seen the model has nowhere to consume speaker id (it does not have an embedding table), it's pointless to pass a speaker id. The whole thing goes like this: before I uploaded the code I need to clean them up, because there were many confusing variable names, bad-formatted filenames, useless codes, etc. When cleaning up data_utils.py I though it could be improved and wrote these. As it has a different logic with the old code I used, I also uploaded data_utils_old.py, just want to declare that, "if you get better results than my pretrained checkpoints, it's just because I improved the code". But, unfortunately, this 'improve' does the opposite. 😓

OlaWod avatar Jan 11 '23 01:01 OlaWod

As can be seen the model has nowhere to consume speaker id (it does not have an embedding table), it's pointless to pass a speaker id.

Yeah, that makes sense. I can see that there's no embedding table in committed versions of the model; just wanted to make sure there wasn't one in a previous version (though I suppose if there were, my finetuning of the pretrained model you provided shouldn't have worked as well as it did).

Thanks for your responsiveness in all these issues; it's nice to see engagement with the code after release.

space-pope avatar Jan 11 '23 16:01 space-pope

I second what @space-pope said. Cheers @OlaWod !

skol101 avatar Jan 11 '23 16:01 skol101

Also thanks to all of you for your interest ^_^

OlaWod avatar Jan 12 '23 07:01 OlaWod

I've fine-tuned using most recent commit and I can still hear original voice (in the Female to Male conversion), ESPECIALLY at the beginning of words after silence . I.e., if there's a silence before the the first phoneme, then mostly like the resulting phonemes pitch will be closer to the source rather than the target.

Could it be the custom dataset have long pauses in wavs? I'll try re-training on 5 second chunks split by silence

dBFS = sound.dBFS
    chunks = split_on_silence(sound,
       # split on silences longer than 1000ms (1 sec)
        min_silence_len=100,

        # anything under -16 dBFS is considered silence
        silence_thresh=dBFS-16, 

        # keep 200 ms of leading/trailing silence
        keep_silence=200
    )

skol101 avatar Jan 15 '23 09:01 skol101

It could be. The speaker encoder does not have special design for the case where there is a lot of silence. The speaker embedding of an utterance can be "averaged" by silence. I mean, suppose a reference utterance is 001111000000001100 (0 denotes silence, 1 denotes speech), if only pass 111111 to speaker encoder, the resulted speaker embedding can properly reflect speaker property; if only pass 0000000 to speaker encoder, the resulted speaker embedding reflects the property of silence; if pass 001111000000001100, the resulted speaker embedding will be in between.

OlaWod avatar Jan 15 '23 13:01 OlaWod

@space-pope how are your test results , especially with unseen F to seen M conversion?

skol101 avatar Jan 15 '23 23:01 skol101

The first thing I tried was finetuning the model for a couple thousand steps with some different data. I got reasonable results, but nothing groundbreaking. Challenging cases like the one you mention were still challenging/not great.

Past that, it's not really fair to compare my results, as I've been porting the code to another framework and using it as a starting point to attempt addressing some of the corner cases that seem tough for all current VC models—sources and targets with large pitch differences like you mention, and the fact that target conversions retain the accent of the source speaker, when ideally it'd work the other way around.

I regret to report that I have not yet surpassed the state of the art :).

space-pope avatar Jan 16 '23 16:01 space-pope

fact that target conversions retain the accent of the source speaker, when ideally it'd work the other way around.

This is to be expected as accents are not part of voice, and no VC will ever be able to change that. accent is part of content. What you are talking about is content transfer in this case. So if you have an American source speaker and a British target You would need several content models, one for each accent, then when you say "tomato" in American English, have it recognize that, then pull the word "tomato" from the "British English" content model and use that content embed before performing the VC.

steven850 avatar Jan 18 '23 10:01 steven850

For large pitch differences, its best to add F0 analyses to the VC. This also helps maintain pitch variance with natural speech VS "read text"

steven850 avatar Jan 18 '23 10:01 steven850

accent is part of content.

I can tell that's the case here, but it strikes me as a strange definition of the word "content". Content should be what is spoken, not how. In other words, the same content spoken by different speakers will sound different due to the shape of their vocal tracts, emotion, idiolect, etc.; but lumping all those attributes into the definition of content feels over-broad.

Speech attributes often get entangled across a model, but, for example, a good multispeaker TTS system will have accent be more dependent on speaker ID than on the actual synthesis input. That's admittedly a bit of a stretch as a comparison since TTS systems receive a more explicit representation of content as input, but disentangling these things is something to shoot for.

space-pope avatar Jan 18 '23 17:01 space-pope

The VC replicates the Voice, so shape of the vocal tract to create the same formants as the target. Accent is nothing to do with the physical properties of the voice or speaker, accent or dialect is simply pronunciation which makes it content. so having separate content models for each accent would work, downside there is the many models needed as well as the system then only working with a single language. the way it works currently lets the system work with any language, it can also handle "vocal grunts" and some singing, none of this would be possible with something that attempts to replicate a targets accent. You would also need a much longer sample file from the target to get something like that to work, a 5 second clip wouldn't cut it, you would need several minutes if not hours of the target speaker to get the correct pronunciations from that target. you also start to loose control of the output the more data you try to pull from the target, so if the system is then reliant on the target for pronunciation and accent, a side effect of this is not being able to control the output emotion anymore, because it relies on the target for this and not the source. So you cant make the output sad or happy, unless you also have the target speaker with sad or happy recordings. Any VC will still require you to do some acting, even if you implemented the system I mentioned, and were ok with all of those limits. You still need to mimic the targets speech patterns and cadence. Might as well use a system that is more flexible and mimic the accent as well.

This is why you see other VC systems(commercial ones) also implement a txt system, so it requires a script to go with the recordings, and those systems are fine tuned to a single target, they cant do many to many. They have a separate model you load for each target, and the system was heavily trained on hrs worth of data of that single target speaker. what your wanting will never work with a many to many model, or with just a short sample from the target.

steven850 avatar Jan 19 '23 05:01 steven850

@space-pope, all, I'm still having the output pitch issues when training a new model or trying to fine-tune the pre-trained one using the current code. Following some of your comments it is not totally clear to me if the code was already fixed or if an older version may work better. Would someone mind to clarify this?, I'm still getting a higher pitch range when converting Female to Male voices. Thanks in advance :-)

fervillamar avatar Dec 09 '23 02:12 fervillamar

If your dataset is sufficiently diverse, I think that kind of pitch issue is inevitable with FreeVC—the pretrained version might seem better because VCTK is less diverse than your data. WavLM simply encodes too much speaker identity for the bottleneck here to remove, and some of it likely ends up leaking through.

space-pope avatar Dec 11 '23 19:12 space-pope

Thanks Josh. I'm using VCTK, I just cannot replicate the pertained model performance. I'm using VCTK data and the training configuration as denoted in the repository but still I get this pitch issue that I don't get it using the pretrained one. Did you try this?.

fervillamar avatar Dec 11 '23 19:12 fervillamar