Could you kindly provide a list of the environment configurations?

Open leozjr opened this issue 1 year ago • 1 comments

Such as an environment.yaml file for Conda, if possible. It seems that there are some issues with my environment, preventing me from starting the training properly.

my environment（2080ti x 8）：

python                    3.8.16
cudatoolkit               11.8.0
cudnn                     8.4.1.50
pytorch                   1.12.1
pytorch-gpu               1.12.1
torchlight                0.0.1
torchlights               0.4.0
torchvision               0.14.1

I can train normally on my other codes and testing process runs fine, but there seem to be some bugs on training:

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
    out = self.encoder(out, xs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
    x = self.layers[i](x)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 244, in forward
    r, _ = self.attn(inputs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 85, in forward
    q = torch.matmul(attn, v)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
    r = self.ffn(r)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 183, in forward
    x2 = x * torch.sigmoid(w)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 111, in forward
    out = self.encoder(out, xs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/arch.py", line 56, in forward
    x = self.layers[i](x)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 246, in forward
    r = self.ffn(r)
  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/leo/Code/Denoiser/HSIR/hsir/model/hsdt/attention.py", line 178, in forward
    x = F.gelu(x)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

  File "/home/leo/.conda/envs/leo_Denoiser/lib/python3.8/site-packages/torch/nn/functional.py", line 2438, in batch_norm
    return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

This should be unrelated to memory and batch size since I still encounter the issue even with the smallest model.

Could you kindly update your environment configuration? It also might be related to the versions of torch and cudnn.

Or may stem from the use of this：sync_batchnorm

Mar 29 '24 11:03 leozjr

Well, I found that if I adjust batch_size to 4 and use 4 gpus it can run, but the cuda memory only takes up 1/6, and if I adjust it to 8 or bigger, an error will be reported.

Mar 29 '24 12:03 leozjr