marlin icon indicating copy to clipboard operation
marlin copied to clipboard

RuntimeError: CUDA error: an illegal instruction was encountered when runing test.py

Open MekkCyber opened this issue 1 year ago • 2 comments

Hello, When running python test.py I get the error :

===================================== ERROR: test_groups (main.Test)

Traceback (most recent call last): File "/fsx/mohamed/dev/marlin/test.py", line 155, in test_groups self.run_problem(m, n, k, *thread_shape, groupsize) File "/fsx/mohamed/dev/marlin/test.py", line 66, in run_problem torch.cuda.synchronize() File "/admin/home/mohamed_mekkouri/miniconda3/envs/exp/lib/python3.10/site-packages/torch/cuda/init.py", line 792, in synchronize return torch._C._cuda_synchronize() RuntimeError: CUDA error: an illegal instruction was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

======================================= ERROR: test_k_stages_divisibility (main.Test)

Traceback (most recent call last): File "/fsx/mohamed/dev/marlin/test.py", line 80, in test_k_stages_divisibility self.run_problem(16, 2 * 256, k, 64, 256) File "/fsx/mohamed/dev/marlin/test.py", line 60, in run_problem A = torch.randn((m, k), dtype=torch.half, device=DEV) RuntimeError: CUDA error: an illegal instruction was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

======================================== ERROR: test_tiles (main.Test)

Traceback (most recent call last): File "/fsx/mohamed/dev/marlin/test.py", line 75, in test_tiles self.run_problem(m, 2 * 256, 1024, thread_k, thread_n) File "/fsx/mohamed/dev/marlin/test.py", line 60, in run_problem A = torch.randn((m, k), dtype=torch.half, device=DEV) RuntimeError: CUDA error: an illegal instruction was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

=========================================== ERROR: test_very_few_stages (main.Test)

Traceback (most recent call last): File "/fsx/mohamed/dev/marlin/test.py", line 85, in test_very_few_stages self.run_problem(16, 2 * 256, k, 64, 256) File "/fsx/mohamed/dev/marlin/test.py", line 60, in run_problem A = torch.randn((m, k), dtype=torch.half, device=DEV) RuntimeError: CUDA error: an illegal instruction was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.


Ran 6 tests in 0.794s

FAILED (errors=4)

the stack i am using : python 3.10.14 torch 2.3.1 cuda_12.1.r12.1 compute_cap 9.0

MekkCyber avatar Jul 02 '24 14:07 MekkCyber

It looks like you are on Hopper because of compute_cap 9.0. There is a known issue with Marlin on Hopper GPUs

mgoin avatar Jul 16 '24 13:07 mgoin

Yes it's Hopper, thank you !

MekkCyber avatar Jul 18 '24 16:07 MekkCyber