Support for ROCM 6
It seems ROCM 5.6 kind of works, but it really requires too much back and forth to have everything working, the new Fedora 40 brings official ROCM support but starting in ROCM 6.
I am using this config from https://github.com/elixir-nx/xla/issues/63
Mix.install(
[
{:web_driver_client, "~> 0.2.0"},
{:kino, "~> 0.12.3"},
{:req, "~> 0.4.14"},
{:erlexec, "~> 2.0"},
{:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
{:exla, github: "elixir-nx/nx", sparse: "exla", override: true}
],
system_env: %{
"XLA_ARCHIVE_URL" =>
"https://static.jonatanklosko.com/builds/0.6.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz",
"ROCM_PATH" => "/usr/lib64/rocm/"
},
config: [nx: [default_backend: {EXLA.Backend, client: :host}]]
I managed to find every pkgs it was asking for (this took a while of back and forth) until I reached this:
18:36:37.767 [warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error,
{:load_failed,
~c"Failed to load NIF library /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/f3927a87654a1bf097d7e31b6277a9f8/_build/dev/lib/exla/priv/libexla: 'librocblas.so.3: cannot open shared object file: No such file or directory'"}}
My guess is xla_extension needs to be built for rocm 7 (librocblas.s0.4), I tried to build it myself but the requirements are too way off the current system (gcc versions and so on)
Will be great if there were official xla binaries for different ROCM versions, as there are for CUDA.
I understand ROCM support is in low priority, but it is really nice for start in AI as it works nicely in linux
I am also trying to reproduce the build by using the provided dockerfiles, but I always get errors:
[3,765 / 6,478] Compiling mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp; 22s local ... (16 actions, 15 running)
ERROR: /app/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/service/gpu/BUILD:1158:23: Compiling xla/service/gpu/cub_sort_kernel.cu.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command (from target //xla/service/gpu:cub_sort_kernel_u32) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 100 arguments skipped)
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr41 = V_MOV_B32_dpp undef $vgpr41(tied-def 0), $vgpr4, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr4 = V_MOV_B32_dpp undef $vgpr4(tied-def 0), killed $vgpr3, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr3 = V_MOV_B32_dpp undef $vgpr3(tied-def 0), $vgpr2, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr42 = V_MOV_B32_dpp undef $vgpr42(tied-def 0), $vgpr8, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr101 = V_MOV_B32_dpp undef $vgpr101(tied-def 0), $vgpr99, 322, 15, 15, 0, implicit $exec
error: Illegal instruction detected: Invalid dpp_ctrl value: broadcasts are not supported on GFX10+
renamable $vgpr98 = V_MOV_B32_dpp undef $vgpr98(tied-def 0), $vgpr96, 322, 15, 15, 0, implicit $exec
12 errors generated when compiling for gfx1036.
Target //xla/extension:xla_extension failed to build
Did you try building by setting the XLA revision as in https://github.com/elixir-nx/xla/issues/63#issuecomment-1844195261?
Setting up the right environment for building was an issue before, that's why we have the Dockerfile. I don't know about ROCM 6, my best bet would be on updating to newer XLA could fix the build, but that usually involves changes to EXLA too. I think it would be a good idea to update sometime soon anyway, but no guarantees.
You could perhaps use Docker with 5.6 for computations/experimentation altogether, though I get it's not very convenient.
@jalberto I updated to the latest XLA revision and EXLA main already uses that. I tried building with ROCm 5.7, but there were errors indicating that XLA already assumes 6.0 (using symbols defined in 6.0+). So I updated the Docker image and managed to successfully build with ROCm 6.0.
Please try XLA_ARCHIVE_URL=https://static.jonatanklosko.com/builds/0.7.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz and nx/exla main. If it doesn't work, you can also try building locally.
thanks, @jonatanklosko will test and report back
@jonatanklosko sorry for the delay, now I have a different error:
: CommandLine Error: Option 'x86-disable-avoid-SFB' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
@jalberto is it when loading the precompiled binary or during build?
That is what happens when I try to rebuild without cache, and the LLVM error is in the console when I start the livebook server
@jonatanklosko in case it helps:
* Getting nx (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 22709, done.
remote: Counting objects: 100% (4025/4025), done.
remote: Compressing objects: 100% (780/780), done.
remote: Total 22709 (delta 3456), reused 3661 (delta 3202), pack-reused 18684
* Getting exla (https://github.com/elixir-nx/nx.git - origin/main)
remote: Enumerating objects: 22709, done.
remote: Counting objects: 100% (4047/4047), done.
remote: Compressing objects: 100% (776/776), done.
remote: Total 22709 (delta 3480), reused 3687 (delta 3228), pack-reused 18662
Resolving Hex dependencies...
Resolution completed in 0.126s
New:
castore 1.0.7
certifi 2.12.0
complex 0.5.0
elixir_make 0.8.4
erlexec 2.0.6
finch 0.18.0
fss 0.1.1
hackney 1.20.1
hpax 0.2.0
idna 6.1.1
jason 1.4.1
kino 0.12.3
metrics 1.0.1
mime 2.0.5
mimerl 1.3.0
mint 1.6.0
nimble_options 1.1.1
nimble_ownership 0.3.1
nimble_pool 1.1.0
parse_trans 3.4.1
req 0.4.14
ssl_verify_fun 1.1.7
table 0.1.2
telemetry 1.2.1
tesla 1.9.0
unicode_util_compat 0.7.0
web_driver_client 0.2.0
xla 0.7.0
* Getting web_driver_client (Hex package)
* Getting kino (Hex package)
* Getting req (Hex package)
* Getting erlexec (Hex package)
* Getting telemetry (Hex package)
* Getting xla (Hex package)
* Getting elixir_make (Hex package)
* Getting nimble_pool (Hex package)
* Getting complex (Hex package)
* Getting finch (Hex package)
* Getting jason (Hex package)
* Getting mime (Hex package)
* Getting nimble_ownership (Hex package)
* Getting castore (Hex package)
* Getting mint (Hex package)
* Getting nimble_options (Hex package)
* Getting hpax (Hex package)
* Getting fss (Hex package)
* Getting table (Hex package)
* Getting hackney (Hex package)
* Getting tesla (Hex package)
* Getting certifi (Hex package)
* Getting idna (Hex package)
* Getting metrics (Hex package)
* Getting mimerl (Hex package)
* Getting parse_trans (Hex package)
* Getting ssl_verify_fun (Hex package)
* Getting unicode_util_compat (Hex package)
==> table
Compiling 5 files (.ex)
Generated table app
==> mime
Compiling 1 file (.ex)
Generated mime app
==> nimble_options
Compiling 3 files (.ex)
Generated nimble_options app
===> Analyzing applications...
===> Compiling unicode_util_compat
===> Analyzing applications...
===> Compiling idna
===> Analyzing applications...
===> Compiling telemetry
==> jason
Compiling 10 files (.ex)
Generated jason app
==> hpax
Compiling 4 files (.ex)
Generated hpax app
===> Analyzing applications...
===> Compiling mimerl
==> ssl_verify_fun
Compiling 7 files (.erl)
Generated ssl_verify_fun app
==> fss
Compiling 4 files (.ex)
Generated fss app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 35 files (.ex)
Generated nx app
==> kino
Compiling 47 files (.ex)
Generated kino app
===> Analyzing applications...
===> Compiling certifi
===> Analyzing applications...
===> Compiling parse_trans
==> nimble_pool
Compiling 2 files (.ex)
Generated nimble_pool app
===> Fetching rebar3_hex v7.0.7
===> Fetching hex_core v0.8.4
===> Fetching verl v1.1.1
===> Analyzing applications...
===> Compiling hex_core
===> Compiling verl
===> Compiling rebar3_hex
===> Fetching rebar3_ex_doc v0.2.22
===> Analyzing applications...
===> Compiling rebar3_ex_doc
make: Entering directory '/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/c_src'
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2 -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include -c -o ei++.o ei++.cpp
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2 -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include -c -o exec.o exec.cpp
g++ -g -std=c++11 -finline-functions -Wall -DHAVE_PTRACE -MMD -DUSE_POLL=1 -O3 -DNDEBUG -DHAVE_SETRESUID -DHAVE_PIPE2 -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -I/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/include -c -o exec_impl.o exec_impl.cpp
mkdir -p /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/
mkdir -p "/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/"
g++ ei++.o exec.o exec_impl.o -L/home/ja/.local/share/mise/installs/erlang/26.2.5/lib/erl_interface-5.5.1/lib -lei -o /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/priv/x86_64-redhat-linux/exec-port
make: Leaving directory '/home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/erlexec/c_src'
===> Analyzing applications...
===> Compiling erlexec
===> Analyzing applications...
===> Compiling metrics
===> Analyzing applications...
===> Compiling hackney
==> castore
Compiling 1 file (.ex)
Generated castore app
==> elixir_make
Compiling 8 files (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
==> exla
Unpacking /home/ja/.cache/xla/0.7.0/cache/external/xla_extension-4j534fd5eueir3oelhrj2pvadm.tar.gz into /home/ja/.cache/mix/installs/elixir-1.16.2-erts-14.2.5/946037843196e7227084dde47bdabba6/deps/exla/exla/cache
Using libexla.so from /home/ja/.cache/xla/exla/elixir-1.16.2-erts-14.2.5-xla-0.7.0-exla-0.7.1-4hm2i3sdtzvi2nwhnlfl4jx27u/libexla.so
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla.cc -o cache/objs/exla.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_mlir.cc -o cache/objs/exla_mlir.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/custom_calls.cc -o cache/objs/custom_calls.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_client.cc -o cache/objs/exla_client.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_cuda.cc -o cache/objs/exla_cuda.o
g++ -fPIC -I/home/ja/.local/share/mise/installs/erlang/26.2.5/erts-14.2.5/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -DLLVM_VERSION_STRING= -O3 -c c_src/exla/exla_nif_util.cc -o cache/objs/exla_nif_util.o
g++ cache/objs/exla.o cache/objs/exla_mlir.o cache/objs/custom_calls.o cache/objs/exla_client.o cache/objs/exla_nif_util.o cache/objs/exla_cuda.o -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -shared -Wl,-rpath,'$ORIGIN/xla_extension/lib'
Compiling 23 files (.ex)
As a sanity check, try without XLA_ARCHIVE_URL, which by default should download just the CPU-enabled binary. This way we will know if it is specific to the ROCm binary. Make sure to reinstall without cache.
yes, that worked as expected, no issues
As a side note: I have same issues building with the new dockerfile
I see, I have no idea where this LLVM error is coming from, I didn't find x86-disable-avoid-SFB, nor X86AvoidStoreForwardingBlocks in openxla/xla source mentioned explicitly. You can try building youtself with XLA_BUILD=1 just in case, but that's a long shot (and provided that it builds without issues) :<
Not sure if this is completely related, but I'm trying to get ROCm 6 working too and built xla with the Dockerized build.sh script which gave me a tarball.
When I set XLA_ARCHIVE_PATH to https://s3.fr-par.scw.cloud/assets.stanko.io/hex/xla/rocm/0.8.0/xla_extension-x86_64-linux-gnu-rocm.tar.gz (which the tarball I built), Mix.install passes, but when something calls EXLA I get a unable to find ld.lld in PATH: No such file or directory error. I tried adding both ROCm bins and llvm bins to my PATH (export PATH="/opt/rocm/llvm/bin:/opt/rocm/bin:$PATH"), but I still get the same error even though I can invoke ld.lld from my shell, and I run livebook server from the same shell.
Just wanted to ask if this is a known problem, or if someone has some pointers for debugging this? Is there another way to use the built tarball except uploading it somewhere and setting XLA_ARCHIVE PATH?
And are there plans to provide pre-built ROCm packages like for CUDA? I know, AMD GFX isn't popular in data centers, but from my experience it's fairly common on desktops and development machines.
@monorkin it may not be related, but the only thing I can think of is to also set export ROCM_PATH="/opt/rocm-6.0" (or whatever the version is).
Is there another way to use the built tarball except uploading it somewhere and setting XLA_ARCHIVE PATH?
I've just added support for XLA_ARCHIVE_PATH (#99), but that's going to be applicable only from the next release, so for now you need to use XLA_ARCHIVE_URL.
And are there plans to provide pre-built ROCm packages like for CUDA?
Not at the moment. The ROCm support is somewhat experimental, in the sense that we don't have the capacity to test it on every release and maintain possibly multiple precompiled builds. Jax (the Python library using XLA) also considers it experimental. This may change in the future, depending on how the ROCm prominence evolves upstream.
@jonatanklosko that did the trick! Thank you!
Now I have a different problem where after creating a serving the runtime crashes.
I added LIVEBOOK_DEBUG=true before running the server, but the log just stops before the crash.
Is there a way to increase the verbosity? Or another way to check why the runtime crashed?
UPDATE: Seems I've run into an OOM issue with my graphics card similar to this issue
Anybody have luck building this lately? I thought it was just my system that wasn't able to compile XLA, but I'm getting the same errors with the docker build:
$ ./build.sh rocm
[...]
INFO: Analyzed target //xla/extension:xla_extension (283 packages loaded, 39134 targets configured).
ERROR: /root/.cache/bazel/_bazel_root/77031b6b54d069fa14d9031c964d5f8f/external/zlib/BUILD.bazel:5:11: Compiling zutil.c failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing CppCompile command (from target @@zlib//:zlib) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 39 arguments skipped)
gcc: error: unrecognized command-line option ‘-Qunused-arguments’
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 38.657s, Critical Path: 0.30s
INFO: 507 processes: 498 internal, 9 local.
ERROR: Build did NOT complete successfully
make: *** [Makefile:24: /build/0.9.0/build/xla_extension-0.9.0-x86_64-linux-gnu-rocm.tar.gz] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
Anybody have luck building this lately? I thought it was just my system that wasn't able to compile XLA, but I'm getting the same errors with the docker build:
$ ./build.sh rocm [...] INFO: Analyzed target //xla/extension:xla_extension (283 packages loaded, 39134 targets configured). ERROR: /root/.cache/bazel/_bazel_root/77031b6b54d069fa14d9031c964d5f8f/external/zlib/BUILD.bazel:5:11: Compiling zutil.c failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing CppCompile command (from target @@zlib//:zlib) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 39 arguments skipped) gcc: error: unrecognized command-line option ‘-Qunused-arguments’ Target //xla/extension:xla_extension failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 38.657s, Critical Path: 0.30s INFO: 507 processes: 498 internal, 9 local. ERROR: Build did NOT complete successfully make: *** [Makefile:24: /build/0.9.0/build/xla_extension-0.9.0-x86_64-linux-gnu-rocm.tar.gz] Error 1 ** (Mix) Could not compile with "make" (exit status: 2). You need to have gcc and make installed. If you are using Ubuntu or any other Debian-based system, install the packages "build-essential". Also install "erlang-dev" package if not included in your Erlang/OTP version. If you're on Fedora, run "dnf group install 'Development Tools'".
I see the error "-Qunused-arguments". Are you by any chance setting CFLAGS or something like that and made a typo? I think it's either that, or a typo somewhere in our code.
-Qunused-arguments is a clang flag, which we now pass:
https://github.com/elixir-nx/xla/blob/52ea2e8cffcbea0775b4a4b3056bb2e61bb8c727/lib/xla.ex#L349
Given gcc: error:, it looks like it still tries to use gcc, I'm not sure why. The stacktrace says external/zlib/BUILD.bazel:5:11, so maybe that's something specific to that part of the build.
We may need adjustments to the build flags to work with ROCM again :<
-Qunused-argumentsis a clang flag, which we now pass:https://github.com/elixir-nx/xla/blob/52ea2e8cffcbea0775b4a4b3056bb2e61bb8c727/lib/xla.ex#L349
Given
gcc: error:, it looks like it still tries to use gcc, I'm not sure why. The stacktrace saysexternal/zlib/BUILD.bazel:5:11, so maybe that's something specific to that part of the build.
TIL 😂 I honestly thought it was supposed to be -Wunused-arguments. Might it be the case that setting CC=clang and CXX=clang++ would fix things? Maybe it's something in their path defaulting to gcc.
We already set those env vars:
https://github.com/elixir-nx/xla/blob/52ea2e8cffcbea0775b4a4b3056bb2e61bb8c727/lib/xla.ex#L343-L344
The snippets mentions $ ./build.sh rocm, which builds inside Docker. We should fix the Docker build sooner or later, but unfortunately it's not a priority for now.
I see the error "-Qunused-arguments". Are you by any chance setting CFLAGS or something like that and made a typo? I think it's either that, or a typo somewhere in our code.
Nope, nothing like that.
I get the same error when trying to build locally, so I don't think it's specific to the Docker setup, at least:
Mix.install(
[
{:axon, "~> 0.5"},
{:nx, "~> 0.5"},
{:exla, "~> 0.5"},
{:stb_image, "~> 0.6"},
{:kino, "~> 0.8"}
],
system_env: %{
"XLA_TARGET" => "rocm",
"XLA_BUILD" => "true"
}
)
ERROR: /home/pikdum/.cache/bazel/_bazel_pikdum/7f24167c0c1f492a25bcd1adc366e6a8/external/com_google_absl/absl/base/BUILD.bazel:53:11: Compiling absl/base/log_severity.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing CppCompile command (from target @@com_google_absl//absl/base:log_severity) external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 50 arguments skipped)
gcc: error: unrecognized command-line option â-Qunused-argumentsâ
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.387s, Critical Path: 0.08s
INFO: 17 processes: 17 internal.
ERROR: Build did NOT complete successfully
make: *** [Makefile:24: /home/pikdum/.cache/xla/0.9.0/build/xla_extension-0.9.0-x86_64-linux-gnu-rocm.tar.gz] Error 1
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla --force", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
** (Mix.Error) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
(mix 1.18.4) lib/mix.ex:618: Mix.raise/2
(elixir_make 0.9.0) lib/elixir_make/compiler.ex:53: ElixirMake.Compiler.compile/1
(mix 1.18.4) lib/mix/task.ex:495: anonymous fn/3 in Mix.Task.run_task/5
(mix 1.18.4) lib/mix/tasks/compile.all.ex:117: Mix.Tasks.Compile.All.run_compiler/2
(mix 1.18.4) lib/mix/tasks/compile.all.ex:97: Mix.Tasks.Compile.All.compile/4
(mix 1.18.4) lib/mix/tasks/compile.all.ex:71: Mix.Tasks.Compile.All.do_run/2
(mix 1.18.4) lib/mix/task.ex:495: anonymous fn/3 in Mix.Task.run_task/5
(mix 1.18.4) lib/mix/tasks/compile.ex:142: Mix.Tasks.Compile.run/1
Same error whether running in a livebook, exs file, or the docker build.sh script.
while I couldn't get the docker build to work I was able to get it to build in my env.
i tried many many things, so i'm not 100% sure what got it to go.
sudo apt-get install clang-18
asdf plugin add bazel
asdf install bazel 7.4.1
asdf set -u bazel 7.4.1
export PATH="/opt/rocm-6.3.4/lib/llvm/bin:$PATH"
CC=/opt/rocm-6.3.4/lib/llvm/bin/clang CXX=/opt/rocm-6.3.4/lib/llvm/bin/clang++ XLA_TARGET=rocm XLA_BUILD=true mix compile
the battle between gcc and clang and that wrapper from bazel is epic, i also noticed that the wrapper /root/.cache/bazel/_bazel_root/77031b6b54d069fa14d9031c964d5f8f/external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc has USE_CLANG hard coded to false with some wonky "True" == "False" very odd.
now i have to figure out how to use it....
i'm on WSL btw.
I was able to run mix compile but then what? My nx install doesn't seem to be able to talk to rocm, it only seems to have hosts available. is there a mix task to create the tarball or is there a command I should find in the docker scrips somewhere?
iex(1)> EXLA.Client.get_supported_platforms
%{host: 32}
my config is
# Configure Nx backend - will use EXLA if available, otherwise BinaryBackend
#config :nx, :default_backend, {Nx.BinaryBackend, []}
# Configure EXLA with ROCm GPU acceleration
#config :nx, :default_backend, {EXLA.Backend, client: :rocm}
config :nx, :default_backend, EXLA.Backend
#config :nx, :default_backend, {EXLA.Backend, client: :rocm, device_id: 1}
#config :nx, :default_defn_options, [client: :rocm, compiler: EXLA, device_id: 0]
config :nx, :default_defn_options, [ compiler: EXLA ]
config :exla, :preferred_clients, [:rocm,:host ]
config :exla, :clients,
#cuda: [platform: :cuda],
rocm: [platform: :rocm,default_device_id: 0],
#tpu: [platform: :tpu],
host: [platform: :host]
...
I was able to run
mix compilebut then what? My nx install doesn't seem to be able to talk to rocm, it only seems to have hosts available. is there a mix task to create the tarball or is there a command I should find in the docker scrips somewhere?iex(1)> EXLA.Client.get_supported_platforms %{host: 32} my config is
Configure Nx backend - will use EXLA if available, otherwise BinaryBackend
#config :nx, :default_backend, {Nx.BinaryBackend, []}
Configure EXLA with ROCm GPU acceleration
#config :nx, :default_backend, {EXLA.Backend, client: :rocm} config :nx, :default_backend, EXLA.Backend #config :nx, :default_backend, {EXLA.Backend, client: :rocm, device_id: 1} #config :nx, :default_defn_options, [client: :rocm, compiler: EXLA, device_id: 0] config :nx, :default_defn_options, [ compiler: EXLA ] config :exla, :preferred_clients, [:rocm,:host ] config :exla, :clients, #cuda: [platform: :cuda], rocm: [platform: :rocm,default_device_id: 0], #tpu: [platform: :tpu], host: [platform: :host]
...
It would be helpful to see the raw compilation logs, especially to check what XLA archive is being used. The config itself seems correct
I noticed rocm has an xla fork, is that what is being used?
It would be helpful to see the raw compilation logs, especially to check what XLA archive is being used. The config itself seems correct
looks like my build.log got overwritten.
is there some way to test the xla that I compiled vs the exla that it seems my project is actually using (the one that doesn't seem to use my GPU)
the one I compiled:
iex(1)> XLA.__info__(:functions)
[
archive_filename_with_target: 0,
archive_path!: 0,
build_archive_dir: 0,
make_env: 0,
precompiled_files: 0,
version: 0,
write_checksums!: 1
]
i cleaned exla and re compiled it and I can now see rocm as available. here is the build output, odd it used gcc since the xla stuff seemed to break with gcc. I'm not really sure how i got this to work.
==> exla
Unpacking /home/schoch/.cache/xla/0.9.1/build/xla_extension-0.9.1-x86_64-linux-gnu-rocm.tar.gz into /mnt/c/Files/downloads/oracle-database_23/deps/exla/cache
Using libexla.so from /home/schoch/.cache/xla/exla/elixir-1.18.4-erts-14.2.5.10-xla-0.9.1-exla-0.10.0-ovopfk7dyws724rrrlopc4ywny/libexla.so
EXLA_CPU_ONLY is not set, checking for nvcc availability
CUDA is not available.
make: Warning: File 'cache/libexla.so' has modification time 1.1 s in the future
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/exla.cc -o cache/0.10.0/objs/exla.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/exla_client.cc -o cache/0.10.0/objs/exla_client.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/exla_mlir.cc -o cache/0.10.0/objs/exla_mlir.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/ipc.cc -o cache/0.10.0/objs/ipc.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/eigh_f32.cc -o cache/0.10.0/objs/custom_calls/eigh_f32.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/eigh_f64.cc -o cache/0.10.0/objs/custom_calls/eigh_f64.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/lu_bf16.cc -o cache/0.10.0/objs/custom_calls/lu_bf16.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/lu_f16.cc -o cache/0.10.0/objs/custom_calls/lu_f16.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/lu_f32.cc -o cache/0.10.0/objs/custom_calls/lu_f32.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/lu_f64.cc -o cache/0.10.0/objs/custom_calls/lu_f64.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/qr_bf16.cc -o cache/0.10.0/objs/custom_calls/qr_bf16.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/qr_f16.cc -o cache/0.10.0/objs/custom_calls/qr_f16.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/qr_f32.cc -o cache/0.10.0/objs/custom_calls/qr_f32.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/exla_cuda.cc -o cache/0.10.0/objs/exla_cuda.o
g++ -fPIC -I/home/schoch/.asdf/installs/erlang/26.2.5.13/erts-14.2.5.10/include -I/mnt/c/Files/downloads/oracle-database_23/deps/fine/include -Icache/xla_extension/include -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -std=c++17 -w -O3 -c c_src/exla/custom_calls/qr_f64.cc -o cache/0.10.0/objs/custom_calls/qr_f64.o
g++ cache/0.10.0/objs/exla.o cache/0.10.0/objs/exla_client.o cache/0.10.0/objs/exla_mlir.o cache/0.10.0/objs/ipc.o cache/0.10.0/objs/custom_calls/eigh_f32.o cache/0.10.0/objs/custom_calls/eigh_f64.o cache/0.10.0/objs/custom_calls/lu_bf16.o cache/0.10.0/objs/custom_calls/lu_f16.o cache/0.10.0/objs/custom_calls/lu_f32.o cache/0.10.0/objs/custom_calls/lu_f64.o cache/0.10.0/objs/custom_calls/qr_bf16.o cache/0.10.0/objs/custom_calls/qr_f16.o cache/0.10.0/objs/custom_calls/qr_f32.o cache/0.10.0/objs/custom_calls/qr_f64.o cache/0.10.0/objs/exla_cuda.o -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -shared -fvisibility=hidden -Wl,-rpath,'$ORIGIN/xla_extension/lib'
make: warning: Clock skew detected. Your build may be incomplete.
Compiling 23 files (.ex)
Generated exla app```
The logs do seem to point to a rocm build, assuming you let it build to the standard :xla cache path. Try setting EXLA_TARGET=rocm as well
edit: the flag seems to not have any effect other than setting the default client, so I think it won't make a difference.
i've set EXLA_TARGET=rocm and XLA_TARGET=rocm in my mix.env i've not played with the device_id yet but it does seem to hit the GPU properly
iex -S mix
Erlang/OTP 26 [erts-14.2.5.10] [source] [64-bit] [smp:32:32] [ds:32:32:10] [async-threads:1] [jit:ns]
Generated oracle_knowledge_graph app
Interactive Elixir (1.18.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> EXLA.Client.get_supported_platforms
%{host: 32, rocm: 1}
Makes sense, XLA_TARGET=rocm is what you need!
sadly i'm getting some very odd behavior
E0000 00:00:1753492043.414392 86341 buffer_comparator.cc:147] Difference at 33: 0.1062, expected 53.5564
E0000 00:00:1753492043.414398 86341 buffer_comparator.cc:147] Difference at 34: 0.463619, expected 53.5787
E0000 00:00:1753492043.414401 86341 buffer_comparator.cc:147] Difference at 35: 0.570177, expected 53.4332
E0000 00:00:1753492043.414403 86341 buffer_comparator.cc:147] Difference at 36: 0.462466, expected 55.7302
E0000 00:00:1753492043.414409 86341 buffer_comparator.cc:147] Difference at 37: -0.272141, expected 54.0394
E0000 00:00:1753492043.414415 86341 buffer_comparator.cc:147] Difference at 38: -0.300299, expected 54.1949
E0000 00:00:1753492043.414419 86341 buffer_comparator.cc:147] Difference at 39: -0.090997, expected 53.0227
E0000 00:00:1753492043.414423 86341 buffer_comparator.cc:147] Difference at 40: -0.252856, expected 50.3297
E0000 00:00:1753492043.414427 86341 buffer_comparator.cc:147] Difference at 41: -0.280166, expected 52.8371
18:07:23.414 [error] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0000 00:00:1753492043.461071 86341 buffer_comparator.cc:147] Difference at 0: 69.7469, expected 52.3189
E0000 00:00:1753492043.461157 86341 buffer_comparator.cc:147] Difference at 1: 70.6092, expected 52.2482
E0000 00:00:1753492043.461165 86341 buffer_comparator.cc:147] Difference at 2: 67.8949, expected 56.1704
E0000 00:00:1753492043.461169 86341 buffer_comparator.cc:147] Difference at 3: 69.749, expected 54.5992
E0000 00:00:1753492043.461173 86341 buffer_comparator.cc:147] Difference at 4: 82.1547, expected 57.0171
E0000 00:00:1753492043.461177 86341 buffer_comparator.cc:147] Difference at 5: 76.9398, expected 53.0067
E0000 00:00:1753492043.461181 86341 buffer_comparator.cc:147] Difference at 6: 81.6174, expected 57.8034
E0000 00:00:1753492043.461185 86341 buffer_comparator.cc:147] Difference at 7: 71.3357, expected 55.9839
E0000 00:00:1753492043.461189 86341 buffer_comparator.cc:147] Difference at 8: 78.9393, expected 53.2206
E0000 00:00:1753492043.461192 86341 buffer_comparator.cc:147] Difference at 9: 71.1563, expected 54.8678
18:07:23.465 [error] Results do not match the reference. This is likely a bug/unexpected loss of precision.
E0000 00:00:1753492044.716026 86341 buffer_comparator.cc:147] Difference at 32: 0.10948, expected 13.7716
E0000 00:00:1753492044.716110 86341 buffer_comparator.cc:147] Difference at 33: -0.000892339, expected 16.091
E0000 00:00:1753492044.716115 86341 buffer_comparator.cc:147] Difference at 34: 0.0728494, expected 14.0829
E0000 00:00:1753492044.716118 86341 buffer_comparator.cc:147] Difference at 35: -0.473043, expected 13.8128
E0000 00:00:1753492044.716127 86341 buffer_comparator.cc:147] Difference at 36: -0.262443, expected 14.2182
E0000 00:00:1753492044.716130 86341 buffer_comparator.cc:147] Difference at 37: -0.0827416, expected 15.0923
E0000 00:00:1753492044.716132 86341 buffer_comparator.cc:147] Difference at 38: 0.270168, expected 14.415
E0000 00:00:1753492044.716135 86341 buffer_comparator.cc:147] Difference at 39: -0.019004, expected 14.7978
E0000 00:00:1753492044.716139 86341 buffer_comparator.cc:147] Difference at 40: -0.0446301, expected 14.2067
E0000 00:00:1753492044.716157 86341 buffer_comparator.cc:147] Difference at 41: 0.115758, expected 14.0595
18:07:24.716 [error] Results do not match the reference. This is likely a bug/unexpected loss of precision.
Segmentation fault (core dumped)```
@jschoch unfortunately this looks like a bug upstream (XLA). The best chance for getting those fixed is narrowing down the computation that causes the issue, then finding a reproduction in Jax and opening an issue on the Jax repo. It may be trickier with ROCm, since there are no official Jax builds, but AMD maintains Docker images with such builds.