FFTNet About Generated Samples.

Hi, i use tensorflow to implement fftnet, about the same network with yours. I train it on the cmu arctic dataset, only one speaker, but i found that the generate sample in the training process is very good.

But, when i use the synthesis mode, use the [128]*receptive filed as the initialize input, also add the local condition, generate one sample and feed it back, the predict sample will be always 128(in mulaw, indeed it is 0).

However, if i use the real audio sample as the input of each time, also predict the sample, i get good results.

Have u meet this case?

Jul 16 '18 08:07 azraelkuan

Hi, I didn't meet this problem. In my earlier experiments (without condition sampling), the perform of generated speech using CMU dataset is similar to the first column audios in the author's demo page.

How about using true 2048 samples as initial samples to generete sentences? Does it also get same predicted values? Did you check the probability of each generated sample? Besides, I guess the generated speech during training cannot represent the model performance, sometimes even unconditional WaveNet can generate intelligent speech during training.

Jul 16 '18 12:07 syang1993

thanks

Jul 17 '18 01:07 azraelkuan

now, i can generate good wavs. but the generate speed is 130bit/s on 1080TI. also i test your code, about 140bit/s on 1080TI. Thanks.

Jul 18 '18 02:07 azraelkuan

Hi, could you share your genereted wavs? I found the quality is not so good as wavenet. Using buff could speed up the generation.

Jul 18 '18 03:07 syang1993

now, the generated wav also is not so good as the wavenet, my model is still training now, when i finish, i will share.

Jul 18 '18 03:07 azraelkuan

Hi, i train your model, i found that the generated wav sounds good, it took two days to train about 200K steps. Also i try the same parameters of yours in my tf experiments, but i can not generate the same wav of yours, may be i don't add the injected noise.

So i try use mu_law as the inputs and add upsample conv to condition, this time i can generate wav which sounds well, about 100k, i will train it to 200k.

fftnet.zip

Jul 20 '18 02:07 azraelkuan

Also i will try the slt speaker, the clb speaker is not so good as slt i think

Jul 20 '18 02:07 azraelkuan

@azraelkuan Thanks for sharing! I heard the attached samples, I found the audio from your model is with fewer artifacts but a little noisy. Do you use sampling to get sample point? In my model, if I use sampling instead of argmax, it will also get noisy. And I insert the zero-padding into each FFT layer, the performance is better but also exists some noise. How about your experiments?

Jul 20 '18 06:07 syang1993

@syang1993 In contrast to your setting, when i use argmax, the predict samples most are 0, so i use the random sample, it works well. And i only add zero-padding at the first fft layer, may be u can try use mulaw as input, not use raw sample. in the afternoon, i add the injected noise to anther experiments, hope it can reduce the noise.

Jul 20 '18 06:07 azraelkuan

@azraelkuan You can try the random sampling using the model trained by this repo, the quality is similar to yours.

The author said the padded zeros in each layer, I found it works better. You can try it like:

I will try to use mulaw input after I go back to school. Hope it works. : )

Jul 20 '18 06:07 syang1993

thanks, i will try it

Jul 20 '18 06:07 azraelkuan

Hi @azraelkuan , I've checked out your implement of FFTNet and I tried to train it on my own dataset, it's super fast when training, but I got some problems when synthesizing. The speed is low and the quality of output is not good, I am now trying to inject noise to my data and train the model again to see if it works better.

Since you have already get a pretty good result, can you tell me how fast it is when synthesizing wavs, and how dose your model with noise injected perform?

Thanks and regards. max

Sep 29 '18 03:09 Maxxiey