diffwave icon indicating copy to clipboard operation
diffwave copied to clipboard

Unconditional synthesis

Open berkeleymalagon opened this issue 3 years ago • 5 comments

I"m running the this command to generate unconditional samples.

python -m diffwave.inference --fast /path/to/model -o output.wav

I've trained for almost 4k epochs on 7k+ sounds. I seem to get the same sound (or a very similar one) regardless of training time.

I have not worked with diffwave before - any tips for debugging this?

Thanks

berkeleymalagon avatar Jun 28 '22 08:06 berkeleymalagon

For context, here are the params during inference in case there's anything obviously wrong with them:

model.params: {'batch_size': 16, 'learning_rate': 0.0002, 'max_grad_norm': None, 'sample_rate': 44100, 'n_mels': 80, 'n_fft': 1024, 'hop_samples': 256, 'crop_mel_frames': 62, 'residual_layers': 30, 'residual_channels': 64, 'dilation_cycle_length': 10, 'unconditional': True, 'noise_schedule': [0.0001, 0.0011183673469387756, 0.002136734693877551, 0.0031551020408163264, 0.004173469387755102, 0.005191836734693878, 0.006210204081632653, 0.007228571428571429, 0.008246938775510203, 0.009265306122448979, 0.010283673469387754, 0.01130204081632653, 0.012320408163265305, 0.013338775510204081, 0.014357142857142857, 0.015375510204081632, 0.016393877551020408, 0.017412244897959183, 0.01843061224489796, 0.019448979591836734, 0.02046734693877551, 0.021485714285714285, 0.02250408163265306, 0.023522448979591836, 0.02454081632653061, 0.025559183673469387, 0.026577551020408163, 0.027595918367346938, 0.028614285714285714, 0.02963265306122449, 0.030651020408163265, 0.031669387755102044, 0.03268775510204082, 0.033706122448979595, 0.03472448979591837, 0.035742857142857146, 0.03676122448979592, 0.0377795918367347, 0.03879795918367347, 0.03981632653061225, 0.04083469387755102, 0.0418530612244898, 0.042871428571428574, 0.04388979591836735, 0.044908163265306125, 0.0459265306122449, 0.046944897959183676, 0.04796326530612245, 0.04898163265306123, 0.05], 'inference_noise_schedule': [0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], 'audio_len': 22051}

berkeleymalagon avatar Jun 28 '22 09:06 berkeleymalagon

I tried using this codebase in the past for SC09 unconditional generation and found that it does not work. An alternative implementation of DiffWave at philsyn/diffwave-unconditional did work. I've released an improved implementation of this at https://github.com/albertfgu/diffwave-sashimi

albertfgu avatar Jul 03 '22 17:07 albertfgu

@Andrechang Hi, using this repo, I have generated silence waves in SC09 datasets, have you succeeded in getting plausible sounds?

Rongjiehuang avatar Jul 19 '22 07:07 Rongjiehuang

It shouldn't output silence waves. When I trained shortly it generated noisy audio.

Andrechang avatar Jul 27 '22 15:07 Andrechang

It seems that the Diffwave paper uses res_channel = 256 for unconditional speech synthesis (but we have 64 in this code), which is why we could not get reasonable sounds.

Rongjiehuang avatar Jul 27 '22 16:07 Rongjiehuang