Gary Wang
Gary Wang
Note also that the paper's dataset is the Blizzard dataset, which has alot of prosody variations in the reading, where the reader will do different style voices. This is why...
My apologies, I didn't see that you've already uploaded your alternative model. Thanks for including the diagram for the architecture, will give it a spin.
@fatchord Absolutely, your model converges so much faster and easier to train than all the wavenet implementations, I can get pretty fast training even on a measly gtx 1060. I...
@fatchord I'm not sure if you've already done ablation studies, but I think you idea of providing the 1-D Resenet really helps with convergence quickly. Great work!
@fatchord I have some compute so I'll run some higher bit runs to see whether further training/tuning will help
@fatchord Given that the alternative model predicts over the bits directly (without splitting it up into coarse/fine), I'm not sure whether a softmax over 4096 classes (for 12 bit audio)...
@geneing did you play around with seq_len to see what effects it has? For my experience I found that it degraded model performance with longer seq_len. However training with seq_len...
@fatchord I also made a seperate repo to refractor the code as well as add in a few things. https://github.com/G-Wang/WaveRNN-Pytorch I made attempts at training a single beta distribution (similar...
Yes, I was thinking of splitting with overlaps in Mel , so so we give the model room to generate the right wav.
@fatchord @geneing Here're some samples for you. https://soundcloud.com/gary-wang-23/sets/wavernn-batch I've included samples for batch synthesis, where the speed is faster than real time (around 2 seconds to generate for a 6...