Please help me!
when I run this command: python train.py -c configs/ljs_base.json -m ljs_base an error occurred like this:
[INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 1, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': '/home/zhang/compatibility_analysis/vits/filelists/ljs_audio_text_train_filelist.txt.cleaned', 'validation_files': '/home/zhang/compatibility_analysis/vits/filelists/ljs_audio_text_val_filelist.txt.cleaned', 'text_cleaners': ['english_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/ljs_base'}
./logs/ljs_base/G_5.pth
[INFO] Loaded checkpoint './logs/ljs_base/G_5.pth' (iteration 1)
./logs/ljs_base/D_5.pth
[INFO] Loaded checkpoint './logs/ljs_base/D_5.pth' (iteration 1)
enumerate(train_loader)= <enumerate object at 0x7f6583451e10>
[INFO] Train Epoch: 1 [0%]
[INFO] [3.5478122234344482, 0.7687594294548035, 0.36579176783561707, 91.01992797851562, 1.5905554294586182, 147.99365234375, 0, 0.0002]
[INFO] Saving model and optimizer state at iteration 1 to ./logs/ljs_base/G_0.pth
[INFO] Saving model and optimizer state at iteration 1 to ./logs/ljs_base/D_0.pth
Traceback (most recent call last):
File "train.py", line 291, in
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/zhang/compatibility_analysis/vits/train.py", line 117, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/home/zhang/compatibility_analysis/vits/train.py", line 138, in train_and_evaluate
for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths) in enumerate(train_loader):
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 971, in _next_data
return self._process_data(data)
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
EOFError: Caught EOFError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhang/anaconda3/envs/vits/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
I just modify epoch=1 in ljs_base.json and reduce .wav files in ljs_audio_text_train_filelist.txt.cleaned. because if I run train command with entire dataset, cuda out of memory will happen. Can you help me fix this problem.
Removing wavs won't help you with CUDA OOM (mels are loaded in ram first), you should reduce batch size instead (= how much mels are loaded in gpu vram at once).
Update: it seems that "Ran out of input" error is because of RAM overload, cause by too many workers replicating loaded audios in RAM. https://github.com/jaywalnut310/vits/pull/118/commits/1c6cd68b1287fad7782eec6d88012ea5ce09d614
Decrease the "num_workers": 8 (config/model.json) worked form me. I was using 16 Thanks @nikich340