Tacotron2 Issues with Inference and using a Custom Dataset

Open conceptofmind opened this issue 4 years ago • 0 comments

I believe I am currently having an issue when training from both scratch and the pre-trained tacotron2 model.

I have collected 14 to 17 hours of pre-processed wav files of Obama speaking. Each file was initially normalized with ffmpeg-normalize and then resampled to the recommended 22050Hz.

I have ensured that:

the Sampling rate of each wav file is 22050Hz
there is only a Single speaker: Obama
the Speech contains a variety of speech phonemes
each Audio file is split into segments of 10 seconds
each of the Audio segments does NOT have silence at the beginning and end of the file
each of the Audio segments does not contain long silences

Here is a link to a drive containing the wav files for inspection:

https://drive.google.com/drive/folders/17RoPoNhcU6ovW0BBkONt3WEXf6ZvuUwF?usp=download

Here is a link to both of the formatted .txt files (train and val):

Train .txt file: https://drive.google.com/file/d/1dxTkagpAT43jP06QAeODWS92GmuqdPqz/view?usp=sharing Validation .txt file: https://drive.google.com/file/d/1dtaHPWTFdXLM1QdOVb2V9H2a_VMKVWRg/view?usp=sharing

I formatted the .txt files in the same way as the LJSpeech dataset. I used wav2vec2.0 for transcriptions. I made sure that any spaces at the start and end of the transcriptions are removed, and that a period was added to the end of each transcript. Each should be on a new line.

The train.py script will run. The directory paths and naming conventions are correct.

This is what a graph of the training inference looks like at epochs 0, 50, and 100:

Epoch 0:

531816681ab45e27dc0e382df3198f71