AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

ROCM/Hip not downloading (?) when ]added

Open cgeoga opened this issue 3 years ago • 2 comments

Apologies in advance if I have missed an issue or some documentation about this.

When I run using AMDGPU and then x = ROCArray(rand(10)), I get a crash with the error hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" (can supply full stacktrace if useful). My GPU is an RX 6600, which is not officially supported by ROCM. Is that the reason for the error? Considering how quickly the ]build AMDGPU` runs for me, I can't help but wonder if some artifact/jll is not getting downloaded. I would appreciate any advice on this.

As more background information, ]add and ]build run more or less instantly with no output, and using gives the output WARNING: using rocRAND_jll.librocrand in module AMDGPU conflicts with an existing identifier. When I ]test AMDGPU, it hangs forever with no output as well, presumably because this test for throwing an error is somehow failing (?).

I don't expect anybody to help me fully debug this and I know that my GPU isn't even really supported by ROCM, but if anybody has any quick thoughts on things to try I would definitely be curious. It has been pretty disappointingly hard to write literally any code in any language that uses this GPU and this package seems like by far the easiest way in. Which is pretty cool.

cgeoga avatar May 06 '22 15:05 cgeoga

This hipErrorNoBinaryForGpu: Unable to find code object for all current devices! error usually means that you need a newer version of HIP. If you have ROCm installed on your system and it's newer than 4.2 (which is what AMDGPU.jl currently provides as artifacts), you can set the environment variable JULIA_AMDGPU_DISABLE_ARTIFACTS=1 and re-build AMDGPU.

using gives the output WARNING: using rocRAND_jll.librocrand in module AMDGPU conflicts with an existing identifier

This is a known bug that I haven't gotten around to fixing, but it should be mostly harmless if you're not using AMDGPU.rand et. al.

When I ]test AMDGPU, it hangs forever with no output as well,

That's slightly concerning; I'd be interested to see if getting HIP working fixes that.

jpsamaroo avatar May 12 '22 13:05 jpsamaroo

Thanks so much for the response! I think this was my misunderstanding then. From the documentation I did have the impression I had to bring my own ROCm and HIP, but from talking to some other people I was told that this package downloads and installs those things for me. I'm on Fedora linux and actually have not been able to install ROCm and HIP (even though, confusingly, there are packages in the official repo called rocm-runtime, for example), and I don't really like any of the distribution options that seem to be available, like using on AMD's own RHEL repo, which seems pretty janky. So for the moment I don't think I'll try to resolve this by polluting my system with REHL or random copr repos.

I won't close it in case you'd rather leave it open, but it sounds like this is a me problem and not an AMDGPU.jl problem, so to me it seems plenty sensible to not mark it as a bug and close it.

cgeoga avatar May 12 '22 13:05 cgeoga

We now have better support for gfx103x other than gfx1030. Although now we mostly require system-wide installation of ROCm for full support.

pxl-th avatar Dec 02 '23 18:12 pxl-th