torch.unique crashes on GPU
🐛 Bug
Running torch.unique with any specified argument fails:
In [5]: torch.unique(torch.arange(10).view(2,5).cuda(), dim=1)
[1] 17003 abort (core dumped) ipython
In [2]: torch.unique(torch.arange(10).view(2,5).cuda(), dim=1, return_counts=True)
...
RuntimeError: unique_by_key failed on 2nd step: hipErrorInvalidDeviceFunction
In [28]: torch.unique( preds['dep'][0][:,1:],dim=-1)
Memory access fault by GPU node-1 (Agent handle: 0x564cdae2ed90) on address 0x7fb309004000. Reason: Page not present or supervisor privilege.
[1] 15313 abort (core dumped)
With a single argument it works:
In [1]: torch.unique( preds['dep'][0][:,1:].float())
Out[1]:
tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27.,
28., 30., 31., 32., 33., 34., 35., 36.], device='cuda:0')
To Reproduce
Steps to reproduce the behavior:
torch.unique(torch.arange(10).view(2,5).cuda(), dim=1)
Expected behavior
No crashes.
Environment
PyTorch version: 1.6.0a0+2a460c0 Is debug build: No CUDA used to build PyTorch: Could not collect
OS: Linux Mint 19.1 Tessa GCC version: (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0 CMake version: version 3.17.2
Python version: 3.8 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect
Versions of relevant libraries: [pip3] numpy==1.18.4 [pip3] numpydoc==0.9.2 [pip3] pytorch-lamb==1.0.0 [pip3] pytorch-lightning==0.8.0 [pip3] pytorch-pretrained-bert==0.6.2 [pip3] pytorch-transformers==1.1.0 [pip3] torch==1.6.0a0+2a460c0 [pip3] torchvision==0.6.0 [conda] Could not collect
Additional context
GPU: Radeon VII ROCm version: 3.5.1
Hi @twuebi , could you try it on ROCm3.7 docker container? https://hub.docker.com/r/rocm/pytorch/tags If issue remains we can take it from there.
$ docker pull rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
$ alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME:/data'
$ drun rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
$ python
>>> import torch
>>> torch.unique(torch.arange(10).view(2,5).cuda(), dim=1)
:0:rocdevice.cpp :2159: 9831975733 us: Device::callbackQueue aborting with status: 0x29
Aborted (core dumped)
Hi @twuebi, thank you for reporting this issue! We've been able reproduce it locally and will update when we have a solution.
Fix should be available when ROCm 4.1 releases. If you need a fix sooner than that, it would require building ROCm PyTorch from source with the latest rocPRIM develop branch.