Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

Cuda error: out of memory with a batch size of 1 on a RTX 3090

Open Setmaster opened this issue 2 years ago • 4 comments

I'm trying to use the Beta version to train but I'm getting the out of memory error no matter the setting I pick.

INFO:me-test:{'train': {'log_interval': 200, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 1, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 12800, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 40000, 'filter_length': 2048, 'hop_length': 400, 'win_length': 2048, 'n_mel_channels': 125, 'mel_fmin': 0.0, 'mel_fmax': None, 'training_files': './logs/me-test/filelist.txt'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 10, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False, 'gin_channels': 256, 'spk_embed_dim': 109}, 'model_dir': './logs/me-test', 'experiment_dir': './logs/me-test', 'save_every_epoch': 5, 'name': 'me-test', 'total_epoch': 20, 'pretrainG': 'pretrained_v2/f0G40k.pth', 'pretrainD': 'pretrained_v2/f0D40k.pth', 'version': 'v2', 'gpus': '0', 'sample_rate': '40k', 'if_f0': 1, 'if_latest': 1, 'save_every_weights': '1', 'if_cache_data_in_gpu': 0} INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. gin_channels: 256 self.spk_embed_dim: 109 INFO:me-test:loaded pretrained pretrained_v2/f0G40k.pth pretrained_v2/f0D40k.pth <All keys matched successfully> <All keys matched successfully> /mnt/c/users/john/documents/RVC-beta-v2-0528/venv/lib/python3.10/site-packages/torch/functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:862.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. /mnt/c/users/john/documents/RVC-beta-v2-0528/venv/lib/python3.10/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [64, 1, 4], strides() = [4, 1, 1] bucket_view.sizes() = [64, 1, 4], strides() = [4, 4, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass INFO:me-test:Train Epoch: 1 [0%] INFO:me-test:[0, 0.0001] INFO:me-test:loss_disc=4.307, loss_gen=3.831, loss_fm=14.001,loss_mel=20.521, loss_kl=5.731 DEBUG:matplotlib:matplotlib data path: /mnt/c/users/john/documents/RVC-beta-v2-0528/venv/lib/python3.10/site-packages/matplotlib/mpl-data DEBUG:matplotlib:CONFIGDIR=/home/se7dev/.config/matplotlib DEBUG:matplotlib:interactive is False DEBUG:matplotlib:platform is linux INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration. INFO:me-test:====> Epoch: 1 [2023-06-05 01:57:52] | (0:00:49.921940) Process Process-1: Traceback (most recent call last): File "/home/se7dev/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/se7dev/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/mnt/c/users/john/documents/RVC-beta-v2-0528/train_nsf_sim_cache_sid_load_pretrain.py", line 218, in run train_and_evaluate( File "/mnt/c/users/john/documents/RVC-beta-v2-0528/train_nsf_sim_cache_sid_load_pretrain.py", line 446, in train_and_evaluate scaler.scale(loss_gen_all).backward() File "/mnt/c/users/john/documents/RVC-beta-v2-0528/venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/mnt/c/users/john/documents/RVC-beta-v2-0528/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Setmaster avatar Jun 05 '23 06:06 Setmaster

V2 is just broken and it barely works. Remove the current nvidia driver with DDU in safe mode then install the latest nvidia driver and try again.

kazuviking2 avatar Jun 05 '23 08:06 kazuviking2

Same here - I actually have 2 x 3090s.

sfingali avatar Jun 05 '23 15:06 sfingali

V2 is just broken and it barely works. Remove the current nvidia driver with DDU in safe mode then install the latest nvidia driver and try again.

is there any easier way to fix it? i use google colab with docker and to launch it local-runtime way i'm kinda forced to update them every time i set up local-runtime.

413x1nkp avatar Jun 23 '23 16:06 413x1nkp

You might want to set pin_memory to false if you still have OOM issue. https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/fad31f24f58fbbbe100dc7bcfc60f3d4e4f8a6bb/train_nsf_sim_cache_sid_load_pretrain.py#L130

tuapuikia avatar Jun 26 '23 08:06 tuapuikia

acutely, it run into OOM just because this is the last piece of memory.

KyleCe avatar Aug 30 '23 17:08 KyleCe

This issue was closed because it has been inactive for 15 days since being marked as stale.

github-actions[bot] avatar Apr 28 '24 04:04 github-actions[bot]