MIOpen MIOpen Error: /MIOpen/src/tensor.cpp:67: Invalid length. Length must be greater than 0

Hi, While running an LSTM-based model using rocm I get the following error, while with CUDA on NVIDIA GPU, it works fine. I checked the size of the tensor is not 0 (i.e., the size is torch.Size([2000, 1536, 128]). Do we not have support for it in ROCm at the moment?

MIOpen Error: /MIOpen/src/tensor.cpp:67: Invalid length. Length must be greater than 0. Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File " ~/tools/ont/bonito/multiprocessing.py", line 110, in run for item in self.iterator: File " ~/tools/ont/bonito/crf/basecall.py", line 69, in (read, compute_scores(model, batch, reverse=reverse)) for read, batch in batches File " ~/tools/ont/bonito/crf/basecall.py", line 34, in compute_scores scores = model(batch.to(dtype).to(device)) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File " ~/tools/ont/bonito/crf/model.py", line 179, in forward return self.encoder(x) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File " ~/tools/ont/bonito/nn.py", line 247, in forward y, h = self.rnn(x) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/gagandee/env/rubicon/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 774, in forward result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers, RuntimeError: miopenStatusUnknownError

Aug 27 '23 23:08 singagan

@singagan Please provide clear reproduction instructions of your problem. ROCm version, base OS version, GPU type, which software needs to be installed etc. Optimal way is providing a docker image so our engineers are able to quickly reproduce the issue without affecting their test systems.

Without the above we will be unable to help you.

/CC @junliume @JehandadKhan @shurale-nkn

Aug 28 '23 14:08 atamazov

@junliume https://github.com/ROCmSoftwarePlatform/MIOpen/labels/ON_HOLD

Aug 28 '23 14:08 atamazov

@atamazov Thanks for the quick reply. I will try to share a docker. Meanwhile, I am attaching the log I generated using env MIOPEN_ENABLE_LOGGING run.log

If I reduce the batch size from 1536 to 768, then I do not see this error. There seems to be some issue with the buffer sizing

HIP version: 5.4.22804-474e8620 AMD clang version 15.0.0 OS: Ubuntu 20.04.3 LTS (Focal Fossa) GPU: Mi210

Aug 29 '23 09:08 singagan

@shurale-nkn could you also take a look at the log above?

Sep 26 '23 06:09 junliume

@singagan Thanks for providing the log and for additional information about batch size, but I am afraid this is not enough for us to understand what happens. If you can provide logs with more info, then we can try again (but docker + repro instructions is be better of course). Recommended env settings for generating log:

export MIOPEN_ENABLE_LOGGING=1 ;\
export MIOPEN_ENABLE_LOGGING_CMD=1 ;\
export MIOPEN_LOG_LEVEL=7 ;\
export MIOPEN_ENABLE_LOGGING_MPMT=1

Note: log will be huge.

Sep 26 '23 13:09 atamazov

@shurale-nkn could you also take a look at the log above?

A month ago I discussed with singagan in chat about this problem. After the requested logs with MIOPEN_ENABLE_LOGGING=1 were passed to me, I offered two options for how Gagandeep can quickly workaround this problem: decrease batch_size or seq_length, and run this operation twice, Because in this configuration MIOpen have to work with buffer size larger than INT type and our implementation can't work with it.

Sep 26 '23 14:09 shurale-nkn

@singagan Has this issue been resolved for you? If so, please close ticket. Thanks!

Apr 23 '24 17:04 ppanchad-amd