pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

torch.unique crashes on GPU

Open twuebi opened this issue 5 years ago • 4 comments

🐛 Bug

Running torch.unique with any specified argument fails:

In [5]: torch.unique(torch.arange(10).view(2,5).cuda(), dim=1)                                                                                                                                                                                                                                                                
[1]    17003 abort (core dumped)  ipython
In [2]: torch.unique(torch.arange(10).view(2,5).cuda(), dim=1, return_counts=True)
...
RuntimeError: unique_by_key failed on 2nd step: hipErrorInvalidDeviceFunction
In [28]: torch.unique( preds['dep'][0][:,1:],dim=-1)                                                                                                          
Memory access fault by GPU node-1 (Agent handle: 0x564cdae2ed90) on address 0x7fb309004000. Reason: Page not present or supervisor privilege.
[1]    15313 abort (core dumped)

With a single argument it works:

In [1]: torch.unique( preds['dep'][0][:,1:].float())                                                                                                          
Out[1]: 
tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
        14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27.,
        28., 30., 31., 32., 33., 34., 35., 36.], device='cuda:0')

To Reproduce

Steps to reproduce the behavior:

torch.unique(torch.arange(10).view(2,5).cuda(), dim=1)

Expected behavior

No crashes.

Environment

PyTorch version: 1.6.0a0+2a460c0 Is debug build: No CUDA used to build PyTorch: Could not collect

OS: Linux Mint 19.1 Tessa GCC version: (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0 CMake version: version 3.17.2

Python version: 3.8 Is CUDA available: Yes CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect

Versions of relevant libraries: [pip3] numpy==1.18.4 [pip3] numpydoc==0.9.2 [pip3] pytorch-lamb==1.0.0 [pip3] pytorch-lightning==0.8.0 [pip3] pytorch-pretrained-bert==0.6.2 [pip3] pytorch-transformers==1.1.0 [pip3] torch==1.6.0a0+2a460c0 [pip3] torchvision==0.6.0 [conda] Could not collect

Additional context

GPU: Radeon VII ROCm version: 3.5.1

twuebi avatar Jul 12 '20 11:07 twuebi

Hi @twuebi , could you try it on ROCm3.7 docker container? https://hub.docker.com/r/rocm/pytorch/tags If issue remains we can take it from there.

sunway513 avatar Aug 25 '20 21:08 sunway513

$ docker pull rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
$ alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME:/data'

$ drun  rocm/pytorch:rocm3.7_ubuntu18.04_py3.6_pytorch
$ python
>>> import torch
>>> torch.unique(torch.arange(10).view(2,5).cuda(), dim=1)    
:0:rocdevice.cpp            :2159: 9831975733 us: Device::callbackQueue aborting with status: 0x29
Aborted (core dumped)

twuebi avatar Aug 31 '20 13:08 twuebi

Hi @twuebi, thank you for reporting this issue! We've been able reproduce it locally and will update when we have a solution.

sunway513 avatar Sep 01 '20 15:09 sunway513

Fix should be available when ROCm 4.1 releases. If you need a fix sooner than that, it would require building ROCm PyTorch from source with the latest rocPRIM develop branch.

jeffdaily avatar Jan 22 '21 18:01 jeffdaily