MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

MIOpen Error: /MIOpen/src/tensor.cpp:67: Invalid length. Length must be greater than 0

Open singagan opened this issue 2 years ago • 7 comments

Hi, While running an LSTM-based model using rocm I get the following error, while with CUDA on NVIDIA GPU, it works fine. I checked the size of the tensor is not 0 (i.e., the size is torch.Size([2000, 1536, 128]). Do we not have support for it in ROCm at the moment?

MIOpen Error: /MIOpen/src/tensor.cpp:67: Invalid length. Length must be greater than 0. Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File " ~/tools/ont/bonito/multiprocessing.py", line 110, in run for item in self.iterator: File " ~/tools/ont/bonito/crf/basecall.py", line 69, in (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches File " ~/tools/ont/bonito/crf/basecall.py", line 34, in compute_scores scores = model(batch.to(dtype).to(device)) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File " ~/tools/ont/bonito/crf/model.py", line 179, in forward return self.encoder(x) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File " ~/tools/ont/bonito/nn.py", line 247, in forward y, h = self.rnn(x) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 774, in forward result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers, RuntimeError: miopenStatusUnknownError

singagan avatar Aug 27 '23 23:08 singagan

@singagan Please provide clear reproduction instructions of your problem. ROCm version, base OS version, GPU type, which software needs to be installed etc. Optimal way is providing a docker image so our engineers are able to quickly reproduce the issue without affecting their test systems.

Without the above we will be unable to help you.

/CC @junliume @JehandadKhan @shurale-nkn

atamazov avatar Aug 28 '23 14:08 atamazov

@junliume https://github.com/ROCmSoftwarePlatform/MIOpen/labels/ON_HOLD

atamazov avatar Aug 28 '23 14:08 atamazov

@atamazov Thanks for the quick reply. I will try to share a docker. Meanwhile, I am attaching the log I generated using env MIOPEN_ENABLE_LOGGING run.log

If I reduce the batch size from 1536 to 768, then I do not see this error. There seems to be some issue with the buffer sizing

HIP version: 5.4.22804-474e8620 AMD clang version 15.0.0 OS: Ubuntu 20.04.3 LTS (Focal Fossa) GPU: Mi210

singagan avatar Aug 29 '23 09:08 singagan

@shurale-nkn could you also take a look at the log above?

junliume avatar Sep 26 '23 06:09 junliume

@singagan Thanks for providing the log and for additional information about batch size, but I am afraid this is not enough for us to understand what happens. If you can provide logs with more info, then we can try again (but docker + repro instructions is be better of course). Recommended env settings for generating log:

export MIOPEN_ENABLE_LOGGING=1 ;\
export MIOPEN_ENABLE_LOGGING_CMD=1 ;\
export MIOPEN_LOG_LEVEL=7 ;\
export MIOPEN_ENABLE_LOGGING_MPMT=1

Note: log will be huge.

atamazov avatar Sep 26 '23 13:09 atamazov

@shurale-nkn could you also take a look at the log above?

A month ago I discussed with singagan in chat about this problem. After the requested logs with MIOPEN_ENABLE_LOGGING=1 were passed to me, I offered two options for how Gagandeep can quickly workaround this problem: decrease batch_size or seq_length, and run this operation twice, Because in this configuration MIOpen have to work with buffer size larger than INT type and our implementation can't work with it.

shurale-nkn avatar Sep 26 '23 14:09 shurale-nkn

@singagan Has this issue been resolved for you? If so, please close ticket. Thanks!

ppanchad-amd avatar Apr 23 '24 17:04 ppanchad-amd