tensorflow icon indicating copy to clipboard operation
tensorflow copied to clipboard

TensorFlow on RTX 5090

Open maludwig opened this issue 10 months ago • 64 comments

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

2.20.0.dev20250314

Custom code

No

OS platform and distribution

Windows 11 - WSL2 - Ubuntu 22.04.5 LTS

Mobile device

No response

Python version

3.10.12

Bazel version

7.4.1

GCC/compiler version

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

CUDA/cuDNN version

CUDA Version: 12.8

GPU model and memory

RTX 5090 32GB

Current behavior?

I had hoped that tensorflow would work on the RTX 5090 at all. It does not, sadly. I tried building from source but that didn't work either. I tried running the environment script but that didn't work either. At least bash is my primary programming language, so I was able to tidy that one up here:

https://github.com/tensorflow/tensorflow/pull/89271

But I wasn't able to get tensorflow running. I had a similar issue with PyTorch, which needed to use CUDA 12.8.* to work on the Blackwell cards, but no dice with the nightly build of tensorflow. Below is my test and the output, and under that is the tf_env.txt from my patched script.

It may be helpful to know that nvidia themselves seem to have it running here:

https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel-25-02.html

But I get the same errors that this other guy does when I try it out:

https://www.reddit.com/r/tensorflow/comments/1iutjoj/tensorflow_2501_cuda_128_rtx_5090_on_wsl2_cuda/

This conversation was another one I found that may be helpful, according to these guys, you need to support CUDA 12.8.1 to support Blackwell (aka the RTX 50## series cards):

https://discuss.ai.google.dev/t/building-tensorflow-from-source-for-rtx5000-gpu-series/65171/15


(tfnightie) mitch@win11ml:~/stable_diff
$ cat tfnightie/test_2.py
import tensorflow as tf
import time

# Check if TensorFlow sees the GPU
print("TensorFlow version:", tf.__version__)
print("Available GPUs:", tf.config.experimental.list_physical_devices('GPU'))

# Matrix multiplication test
shape = (5000, 5000)
a = tf.random.normal(shape)
b = tf.random.normal(shape)

# Time execution on GPU
with tf.device('/GPU:0'):
    print("Running on GPU...")
    start_time = time.time()
    c = tf.matmul(a, b)
    tf.print("Matrix multiplication (GPU) done.")
    print("Execution time (GPU):", time.time() - start_time, "seconds")

# Time execution on CPU for comparison
with tf.device('/CPU:0'):
    print("Running on CPU...")
    start_time = time.time()
    c = tf.matmul(a, b)
    tf.print("Matrix multiplication (CPU) done.")
    print("Execution time (CPU):", time.time() - start_time, "seconds")




(tfnightie) mitch@win11ml:~/stable_diff
$ python tfnightie/test_2.py
2025-03-14 21:35:33.400099: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
TensorFlow version: 2.20.0-dev20250314
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1742009735.413544  326199 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
Available GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
W0000 00:00:1742009735.417720  326199 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
I0000 00:00:1742009735.572153  326199 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0
2025-03-14 21:35:36.969440: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_PTX'

2025-03-14 21:35:36.969480: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'

2025-03-14 21:35:36.969505: W tensorflow/core/framework/op_kernel.cc:1843] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2025-03-14 21:35:36.969533: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
Traceback (most recent call last):
  File "/home/mitch/stable_diff/tfnightie/test_2.py", line 10, in <module>
    a = tf.random.normal(shape)
  File "/home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 6027, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Mul] name:

Also, while nvidia's site says that the Compute Capability of the RTX5090 is "10.0", the card itself seems to report "12.0". I am not so sure that info will be helpful, but it spun me for a loop:


$ cat <<EOF > card_details.cu
> #include <cuda_runtime.h>
#include <iostream>

int main() {
    cudaDeviceProp prop;
    int device;

    cudaGetDevice(&device); // Get the current device ID
    cudaGetDeviceProperties(&prop, device); // Get device properties

    size_t free_mem, total_mem;
    cudaMemGetInfo(&free_mem, &total_mem); // Get VRAM usage

    std::cout << "GPU Name: " << prop.name << std::endl;
    std::cout << "Compute Capability: " << prop.major << "." << prop.minor << std::endl;
    std::cout << "VRAM Usage: " << (total_mem - free_mem) / (1024 * 1024) << " MB / " << total_mem / (1024 * 1024) << " MB" << std::endl;

    return 0;
}
EOF



$ nvcc card_details.cu -o card_details && ./card_details
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
GPU Name: NVIDIA GeForce RTX 5090
Compute Capability: 12.0
VRAM Usage: 1763 MB / 32606 MB

tf_env.txt


== check python ====================================================
python version: 3.10.12
python branch:
python build version: ('main', 'Feb  4 2025 14:57:36')
python compiler version: GCC 11.4.0
python implementation: CPython


== check os platform ===============================================
os: Linux
os kernel version: #1 SMP Tue Nov 5 00:21:55 UTC 2024
os release version: 5.15.167.4-microsoft-standard-WSL2
os platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
freedesktop os release: {'NAME': 'Ubuntu', 'ID': 'ubuntu', 'PRETTY_NAME': 'Ubuntu 22.04.5 LTS', 'VERSION_ID': '22.04', 'VERSION': '22.04.5 LTS (Jammy Jellyfish)', 'VERSION_CODENAME': 'jammy', 'ID_LIKE': 'debian', 'HOME_URL': 'https://www.ubuntu.com/', 'SUPPORT_URL': 'https://help.ubuntu.com/', 'BUG_REPORT_URL': 'https://bugs.launchpad.net/ubuntu/', 'PRIVACY_POLICY_URL': 'https://www.ubuntu.com/legal/terms-and-policies/privacy-policy', 'UBUNTU_CODENAME': 'jammy'}
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='win11ml', release='5.15.167.4-microsoft-standard-WSL2', version='#1 SMP Tue Nov 5 00:21:55 UTC 2024', machine='x86_64')
architecture: ('64bit', 'ELF')
machine: x86_64

== are we in docker ================================================
No

== c++ compiler ====================================================
/usr/bin/c++
c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== check pips ======================================================
numpy                   2.1.3
protobuf                5.29.3
tf_nightly              2.20.0.dev20250314

== check for virtualenv ============================================
Running inside a virtual environment.

== tensorflow import ===============================================
2025-03-14 21:02:48.002965: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1742007769.198398  317963 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
W0000 00:00:1742007769.202246  317963 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
I0000 00:00:1742007769.355021  317963 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0

tf.version.VERSION = 2.20.0-dev20250314
tf.version.GIT_VERSION = v1.12.1-123444-g07ff428d432
tf.version.COMPILER_VERSION = Ubuntu Clang 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144)

Sanity check: <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>
libcudnn not found

== env =============================================================
LD_LIBRARY_PATH /usr/local/cuda-12.8/lib64:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ======================================================
Fri Mar 14 21:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 572.70         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:09:00.0 Off |                  N/A |
|  0%   43C    P1             78W /  600W |    2115MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A              31      G   /Xwayland                             N/A      |
|    0   N/A  N/A              35      G   /Xwayland                             N/A      |
+-----------------------------------------------------------------------------------------+

== cuda libs =======================================================
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.8.89
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudart.so.12.8.90

== tensorflow installation =========================================
tensorflow not found

== tf_nightly installation =========================================
Name: tf_nightly
Version: 2.20.0.dev20250314
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author-email: [email protected]
License: Apache 2.0
Location: /home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages
Required-by:

== python version ==================================================
(major, minor, micro, releaselevel, serial)
(3, 10, 12, 'final', 0)

== bazel version ===================================================
Bazelisk version: v1.25.0
Build label: 7.4.1
Build time: Mon Nov 11 21:24:53 2024 (1731360293)
Build timestamp: 1731360293
Build timestamp as int: 1731360293

Standalone code to reproduce the issue

Try running anything with an RTX 5090. My test script is above.

Relevant log output


maludwig avatar Mar 15 '25 03:03 maludwig

same problem

huiyijiangling avatar Mar 15 '25 07:03 huiyijiangling

I should mention that I'm a Senior AI Developer by trade and I'm more than happy to invest my personal time in helping to fix this, I'm just not sure where to start.

maludwig avatar Mar 17 '25 02:03 maludwig

I should also mention that the latest clang release here supports building for compute_100/sm_100+

https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.0

It's not supported in LLVM 18. But it compiles this on my GPU just fine (extra logs attached just in case they help someone else).


mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ clang++ --version
clang version 20.1.0 (https://github.com/llvm/llvm-project 24a30daaa559829ad079f2ff7f73eb4e18095f88)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/mitch/stable_diff/fix_tf/llvm/LLVM-20.1.0-Linux-X64/bin




mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ cat card_details.cu
#include <cuda_runtime.h>
#include <iostream>

int main() {
    cudaDeviceProp prop;
    int device;

    cudaGetDevice(&device); // Get the current device ID
    cudaGetDeviceProperties(&prop, device); // Get device properties

    size_t free_mem, total_mem;
    cudaMemGetInfo(&free_mem, &total_mem); // Get VRAM usage

    std::cout << "GPU Name: " << prop.name << std::endl;
    std::cout << "Compute Capability: " << prop.major << "." << prop.minor << std::endl;
    std::cout << "VRAM Usage: " << (total_mem - free_mem) / (1024 * 1024) << " MB / " << total_mem / (1024 * 1024) << " MB" << std::endl;

    return 0;
}




mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ clang++ -std=c++17 --cuda-gpu-arch=sm_120 -x cuda --cuda-path="$CUDA_HOME" -I"$CUDA_HOME/include" -L"$CUDA_HOME/lib64"  -lcudart card_details.cu -o card_details
clang++: warning: CUDA version 12.8 is only partially supported [-Wunknown-cuda-version]




mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ ./card_details
GPU Name: NVIDIA GeForce RTX 5090
Compute Capability: 12.0
VRAM Usage: 1763 MB / 32606 MB




mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ echo "$CUDA_HOME"
/usr/local/cuda-12.8




mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ ls "$CUDA_HOME"
DOCS  EULA.txt  README  bin  compute-sanitizer  doc  extras  gds  include  lib64  libnvvp  nsightee_plugins  nvml  nvvm  share  src  targets  tools  version.json




mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ cat /usr/local/cuda-12.8/version.json | head -n5
{
   "cuda" : {
      "name" : "CUDA SDK",
      "version" : "12.8.1"
   },

maludwig avatar Mar 17 '25 04:03 maludwig

I'm going to keep writing my attempts to get things working here. I've cut a branch on my fork, still no luck, but here's some half-discoveries. More and more of the project is building as I continue, zero idea how far away I am from victory. He's the branch I'm on, compared with the base:

https://github.com/maludwig/tensorflow/compare/ml/fixing_tf_env...maludwig:tensorflow:ml/attempting_build_rtx5090?expand=1

A few findings:

  • CUDA 12.8.1 adds support for the RTX 5090 (and other Blackwells), so we need that
  • There's a bug in cutlass, which was forked for tensorflow for a reason I don't know, the bug was fixed here: https://github.com/NVIDIA/cutlass/pull/1784/files
  • The old fork, done by @chsigg was certainly done for a reason, no idea what I'm breaking by going back to the NVIDIA main branch here. Not sure how to message people on GitHub, but maybe they'll get notified on this?
  • I updated NCCL to the latest 2.26.2 wheel
  • Build is still failing, but it's taking WAY longer to fail now. This is possibly a good sign.

maludwig avatar Mar 17 '25 06:03 maludwig

Yep, I'm stopping for the night, it's currently stuck on what seems to be duplicate logging macros, looks like maybe two different logging libraries are somehow being included at the same time. Two very very similar logging libraries. But instead of taking 30 seconds before it fails, now it takes 17 minutes to fail, which I define as progress!


external/com_google_absl/absl/log/check.h:122:9: warning: 'CHECK_LT' macro redefined [-Wmacro-redefined]
  122 | #define CHECK_LT(val1, val2) \
      |         ^
external/local_xla/xla/tsl/platform/default/logging.h:498:9: note: previous definition is here
  498 | #define CHECK_LT(val1, val2) CHECK_OP(Check_LT, <, val1, val2)
      |         ^
In file included from tensorflow/core/kernels/fill_empty_rows_functor_gpu.cu.cc:21:
In file included from ./tensorflow/core/common_runtime/gpu/gpu_event_mgr.h:21:
In file included from ./tensorflow/core/common_runtime/device/device_event_mgr.h:30:
In file included from ./tensorflow/core/platform/stream_executor.h:21:
In file included from external/local_xla/xla/stream_executor/dnn.h:47:
In file included from external/local_xla/xla/stream_executor/scratch_allocator.h:26:
In file included from external/local_xla/xla/stream_executor/device_memory_allocator.h:22:
external/com_google_absl/absl/log/check.h:124:9: warning: 'CHECK_GE' macro redefined [-Wmacro-redefined]
  124 | #define CHECK_GE(val1, val2) \
      |         ^
external/local_xla/xla/tsl/platform/default/logging.h:499:9: note: previous definition is here
  499 | #define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
      |         ^
In file included from tensorflow/core/kernels/fill_empty_rows_functor_gpu.cu.cc:21:
In file included from ./tensorflow/core/common_runtime/gpu/gpu_event_mgr.h:21:
In file included from ./tensorflow/core/common_runtime/device/device_event_mgr.h:30:
In file included from ./tensorflow/core/platform/stream_executor.h:21:
In file included from external/local_xla/xla/stream_executor/dnn.h:47:
In file included from external/local_xla/xla/stream_executor/scratch_allocator.h:26:
In file included from external/local_xla/xla/stream_executor/device_memory_allocator.h:22:

maludwig avatar Mar 17 '25 08:03 maludwig

VICTORY

Ok I didn't stop for the night. Instead, I just ignored all manner of warnings that shouldn't be ignored:

bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel  --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined

And bam!

INFO: Found 1 target...
Target //tensorflow/tools/pip_package:wheel up-to-date:
  bazel-bin/tensorflow/tools/pip_package/wheel_house/tensorflow-2.20.0.dev0+selfbuilt-cp310-cp310-linux_x86_64.whl
INFO: Elapsed time: 87.690s, Critical Path: 86.67s
INFO: 2 processes: 1 internal, 1 local.
INFO: Build completed successfully, 2 total actions

No idea if it'll work, but it did build! I've pushed the latest code changes to my branch.

https://github.com/maludwig/tensorflow/compare/ml/fixing_tf_env...maludwig:tensorflow:ml/attempting_build_rtx5090?expand=1

maludwig avatar Mar 17 '25 10:03 maludwig

It passed one test!


(tfnightie) mitch@win11ml:~/stable_diff/fix_tf/tensorflow
$ bazel test --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel  --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined tensorflow/python/kernel_tests/nn_ops:softmax_op_test
WARNING: The following configs were expanded more than once: [cuda_clang, cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
INFO: Reading 'startup' options from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=243
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc:
  Inherited 'common' options: --announce_rc --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility --noenable_bzlmod --noincompatible_enable_cc_toolchain_resolution --noincompatible_enable_android_toolchain_resolution --experimental_repo_remote_exec --java_runtime_version=remotejdk_21
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc:
  Inherited 'build' options: --repo_env=ML_WHEEL_TYPE=snapshot --repo_env=ML_WHEEL_BUILD_DATE= --repo_env=ML_WHEEL_VERSION_SUFFIX= --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --host_features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc:
  Inherited 'build' options: --action_env PYTHON_BIN_PATH=/home/mitch/.virtualenvs/tfnightie/bin/python --action_env PYTHON_LIB_PATH=/home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages --python_path=/home/mitch/.virtualenvs/tfnightie/bin/python --action_env LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:/home/mitch/stable_diff/fix_tf/libs/cudnn-linux-x86_64-9.8.0.87_cuda12-archive/lib: --config=cuda_clang --action_env CLANG_CUDA_COMPILER_PATH=/home/mitch/stable_diff/fix_tf/llvm/LLVM-20.1.0-Linux-X64/bin/clang-20 --config=cuda_clang
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc:
  'test' options: --test_env=GTEST_INSTALL_FAILURE_SIGNAL_HANDLER=1
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc:
  'test' options: --test_size_filters=small,medium --test_env=LD_LIBRARY_PATH
INFO: Found applicable config definition build:short_logs in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition test:v2 in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-oss_serial,-v1only --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-v1only
INFO: Found applicable config definition build:cuda_clang in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --config=cuda --@local_config_cuda//:cuda_compiler=clang --copt=-Qunused-arguments --repo_env=HERMETIC_CUDA_COMPUTE_CAPABILITIES=sm_60,sm_70,sm_80,sm_89,compute_90 --copt=-Wno-unknown-cuda-version --host_linkopt=-fuse-ld=lld --host_linkopt=-lm --linkopt=-fuse-ld=lld --linkopt=-lm
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda --repo_env=HERMETIC_CUDA_VERSION=12.5.1 --repo_env=HERMETIC_CUDNN_VERSION=9.3.0 --@local_config_cuda//cuda:include_cuda_libs=true
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --repo_env HERMETIC_CUDA_VERSION=12.8.1 --repo_env HERMETIC_CUDNN_VERSION=9.8.0 --repo_env HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
INFO: Found applicable config definition build:cuda_clang in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --config=cuda --@local_config_cuda//:cuda_compiler=clang --copt=-Qunused-arguments --repo_env=HERMETIC_CUDA_COMPUTE_CAPABILITIES=sm_60,sm_70,sm_80,sm_89,compute_90 --copt=-Wno-unknown-cuda-version --host_linkopt=-fuse-ld=lld --host_linkopt=-lm --linkopt=-fuse-ld=lld --linkopt=-lm
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda --repo_env=HERMETIC_CUDA_VERSION=12.5.1 --repo_env=HERMETIC_CUDNN_VERSION=9.3.0 --@local_config_cuda//cuda:include_cuda_libs=true
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --repo_env HERMETIC_CUDA_VERSION=12.8.1 --repo_env HERMETIC_CUDNN_VERSION=9.8.0 --repo_env HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda --repo_env=HERMETIC_CUDA_VERSION=12.5.1 --repo_env=HERMETIC_CUDNN_VERSION=9.3.0 --@local_config_cuda//cuda:include_cuda_libs=true
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --repo_env HERMETIC_CUDA_VERSION=12.8.1 --repo_env HERMETIC_CUDNN_VERSION=9.8.0 --repo_env HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
INFO: Found applicable config definition build:cuda_wheel in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --@local_config_cuda//cuda:include_cuda_libs=false
INFO: Found applicable config definition build:linux in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
DEBUG: /home/mitch/.cache/bazel/_bazel_mitch/98f54844abcf3e1cdc99e9d96b271d9e/external/local_xla/third_party/py/python_repo.bzl:154:14:
HERMETIC_PYTHON_VERSION variable was not set correctly, using default version.
Python 3.10 will be used.
To select Python version, either set HERMETIC_PYTHON_VERSION env variable in
your shell:
  export HERMETIC_PYTHON_VERSION=3.12
OR pass it as an argument to bazel command directly or inside your .bazelrc
file:
  --repo_env=HERMETIC_PYTHON_VERSION=3.12
DEBUG: /home/mitch/.cache/bazel/_bazel_mitch/98f54844abcf3e1cdc99e9d96b271d9e/external/local_xla/third_party/py/python_repo.bzl:87:10:
=============================
Hermetic Python configuration:
Version: "3.10"
Kind: ""
Interpreter: "default" (provided by rules_python)
Requirements_lock label: "@python_version_repo//:requirements_lock_3_10.txt"
=====================================
WARNING: The following configs were expanded more than once: [cuda_clang, cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
WARNING: Build options --@@local_config_cuda//cuda:include_cuda_libs, --copt, --cxxopt, and 2 more have changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed 2 targets (749 packages loaded, 56015 targets configured).
INFO: Found 2 test targets...
INFO: Elapsed time: 270.116s, Critical Path: 245.71s
INFO: 2560 processes: 378 internal, 2182 local.
INFO: Build completed successfully, 2560 total actions
//tensorflow/python/kernel_tests/nn_ops:softmax_op_test_cpu              PASSED in 217.4s
//tensorflow/python/kernel_tests/nn_ops:softmax_op_test_gpu              PASSED in 218.4s

Executed 2 out of 2 tests: 2 tests pass.

I also installed the wheel generated in the last step to a new python venv, and it worked!





(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ python -c "import tensorflow as tf; print(tf.__version__)"
2025-03-17 04:37:51.455319: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742207871.466384  646442 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742207871.469996  646442 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742207871.479137  646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207871.479166  646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207871.479169  646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207871.479172  646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 04:37:51.481701: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

2.20.0-dev0+selfbuilt



(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2025-03-17 04:38:02.348770: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742207882.360431  646471 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742207882.364089  646471 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742207882.373383  646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207882.373422  646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207882.373426  646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207882.373437  646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 04:38:02.376028: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]



(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ cat test_gpu.py
import tensorflow as tf

# Check if GPU is available
gpus = tf.config.list_physical_devices('GPU')
if not gpus:
    print("🚫 No GPU found!")
else:
    print(f"✅ Found GPU(s): {[gpu.name for gpu in gpus]}")

# Place operations on GPU
with tf.device('/GPU:0'):
    # Create two tensors
    a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
    b = tf.constant([[5.0, 6.0], [7.0, 8.0]])

    # Add tensors
    add_result = tf.add(a, b)
    print("\nAddition result:")
    print(add_result)

    # Matrix multiplication
    matmul_result = tf.matmul(a, b)
    print("\nMatrix multiplication result:")
    print(matmul_result)

# Print device placement info (optional, debug)
print("\nDevice placement log:")
tf.debugging.set_log_device_placement(True)





(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ python test_gpu.py
2025-03-17 04:38:25.409242: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742207905.420314  646517 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742207905.423851  646517 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742207905.432651  646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207905.432680  646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207905.432684  646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207905.432686  646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 04:38:25.435305: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

✅ Found GPU(s): ['/physical_device:GPU:0']
I0000 00:00:1742207906.790435  646517 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0

Addition result:
tf.Tensor(
[[ 6.  8.]
 [10. 12.]], shape=(2, 2), dtype=float32)

Matrix multiplication result:
tf.Tensor(
[[19. 22.]
 [43. 50.]], shape=(2, 2), dtype=float32)

Device placement log:

I...am...going...to...run all the tests overnight? My build process is complete trash and I have no idea what I'm doing, but I COULD also PR this code, but like, that's slightly terrifying. I've ignoring probably thousands of warnings that a competent C++ developer could probably actually solve, rather than just ignore...

maludwig avatar Mar 17 '25 10:03 maludwig

Tests didn't pass, but it did build! And it could do basic matrix addition and multiplication in Python! NOW I'm definitely going to bed though.

maludwig avatar Mar 17 '25 11:03 maludwig

It also is able to do the classic "hello world" ML task of learning digits on MNIST, but the warnings are PLENTIFUL and cryptic. I don't know what they mean, but the final model happens to work great!


(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ cat mnist_test.py
#!/usr/bin/env python

import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize pixel values to [0,1]
x_train = x_train / 255.0
x_test = x_test / 255.0

# Build the model
model = models.Sequential([
    layers.Flatten(input_shape=(28, 28)),      # Flatten 28x28 to 784
    layers.Dense(128, activation='relu'),      # Hidden layer
    layers.Dense(10, activation='softmax')     # Output layer
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5, validation_split=0.1)

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")

# Make predictions
predictions = model.predict(x_test)

# Example: Print prediction for the first image
print(f"First test sample - Predicted: {np.argmax(predictions[0])}, Actual: {y_test[0]}")





(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ ./mnist_test.py
2025-03-17 11:23:11.786039: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742232191.796888  662647 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742232191.800405  662647 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742232191.809207  662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742232191.809234  662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742232191.809238  662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742232191.809259  662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 11:23:11.811904: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/mitch/.virtualenvs/test5090build/lib/python3.10/site-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
I0000 00:00:1742232193.910442  662647 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0
Epoch 1/5
2025-03-17 11:23:15.366974: I external/local_xla/xla/service/service.cc:152] XLA service 0x7f5928008d30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2025-03-17 11:23:15.367006: I external/local_xla/xla/service/service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 5090, Compute Capability 12.0
2025-03-17 11:23:15.376904: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1742232195.434380  662725 cuda_dnn.cc:529] Loaded cuDNN version 90800
2025-03-17 11:23:16.211437: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 80 bytes spill stores, 80 bytes spill loads

2025-03-17 11:23:16.220559: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95_0', 164 bytes spill stores, 164 bytes spill loads

2025-03-17 11:23:16.340804: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 392 bytes spill stores, 392 bytes spill loads

2025-03-17 11:23:16.364181: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads

2025-03-17 11:23:16.374280: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 176 bytes spill stores, 176 bytes spill loads

2025-03-17 11:23:16.385374: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads

2025-03-17 11:23:16.393417: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 292 bytes spill stores, 292 bytes spill loads

2025-03-17 11:23:16.451825: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 532 bytes spill stores, 532 bytes spill loads

2025-03-17 11:23:16.522600: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 168 bytes spill stores, 168 bytes spill loads

2025-03-17 11:23:16.556430: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 1040 bytes spill stores, 1040 bytes spill loads

2025-03-17 11:23:16.607519: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 112 bytes spill stores, 112 bytes spill loads

2025-03-17 11:23:16.806055: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 4920 bytes spill stores, 4992 bytes spill loads

2025-03-17 11:23:16.867917: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 5084 bytes spill stores, 5028 bytes spill loads

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742232197.511568  662725 device_compiler.h:196] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1684/1688 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.8767 - loss: 0.44642025-03-17 11:23:20.774706: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 32 bytes spill stores, 32 bytes spill loads

2025-03-17 11:23:20.804309: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 288 bytes spill stores, 288 bytes spill loads

2025-03-17 11:23:20.807183: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads

2025-03-17 11:23:20.828074: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads

2025-03-17 11:23:20.895561: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 752 bytes spill stores, 752 bytes spill loads

2025-03-17 11:23:21.076717: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 80 bytes spill stores, 80 bytes spill loads

2025-03-17 11:23:21.096547: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 72 bytes spill stores, 72 bytes spill loads

2025-03-17 11:23:21.177785: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 4920 bytes spill stores, 4992 bytes spill loads

2025-03-17 11:23:21.227351: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 5084 bytes spill stores, 5028 bytes spill loads

1688/1688 ━━━━━━━━━━━━━━━━━━━━ 8s 3ms/step - accuracy: 0.8768 - loss: 0.4459 - val_accuracy: 0.9668 - val_loss: 0.1275
Epoch 2/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9598 - loss: 0.1359 - val_accuracy: 0.9710 - val_loss: 0.0985
Epoch 3/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9748 - loss: 0.0853 - val_accuracy: 0.9728 - val_loss: 0.0920
Epoch 4/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9810 - loss: 0.0640 - val_accuracy: 0.9775 - val_loss: 0.0809
Epoch 5/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9855 - loss: 0.0473 - val_accuracy: 0.9782 - val_loss: 0.0797
313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9720 - loss: 0.0911
Test accuracy: 0.9753
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step
First test sample - Predicted: 7, Actual: 7

maludwig avatar Mar 17 '25 17:03 maludwig

@maludwig you should upgrade commit hash256 XLA on bazel file and it should work

johnnynunez avatar Mar 19 '25 08:03 johnnynunez

Sorry that's a bit cryptic for me. I'm normally a Python dev, apologies. Did you mean in my commits on my branch above?

https://github.com/maludwig/tensorflow/compare/ml/fixing_tf_env...maludwig:tensorflow:ml/attempting_build_rtx5090?expand=1

maludwig avatar Mar 19 '25 08:03 maludwig

+1

mnjm avatar Mar 20 '25 02:03 mnjm

Steps to get it running on your RTX 5000 series card

Guide for all platforms

Install llvm 20.1.0

LLVM 20.1.0 is required to compile code for compute capability 10.0 and 12.0 (RTX 5000 series).

All platforms here:

https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.0

Install CUDA 12.8.1

CUDA 12.8.1 is required to compile code for compute capability 10.0 and 12.0 (RTX 5000 series).

Also install cuDNN 9.8.0 and NCCL 2, for CUDA 12.

Install Python 3.10.12

This just happens to be the version I'm using and may be completely unnecessary. I personally love pyenv because it installs it to your local user, so you don't need to fret about admin/root permissions.

Make a Python venv for tensorflow

This will prevent your system from being polluted by tensorflow dependencies, and will make it much much much easier to clean up if you want to start over.

Install Bazelisk

Bazelisk is a wrapper for Bazel that downloads the correct version of Bazel for the project.

Clone tensorflow

echo "Clone tensorflow"
git clone [email protected]:tensorflow/tensorflow.git
cd tensorflow
echo "Add my remote to the repo"
git remote add maludwig '[email protected]:maludwig/tensorflow.git'
echo "Fetch my remote"
git fetch --all
echo "Checkout my branch"
git checkout ml/attempting_build_rtx5090
echo "Pull my branch"
git pull maludwig ml/attempting_build_rtx5090

Configure bazel


echo "Configure bazel, these are the settings I used, but I'm not sure if they're correct, or if they just happened to work for me."
export HERMETIC_CUDA_VERSION=12.8.1
export HERMETIC_CUDNN_VERSION=9.8.0
export HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
export LOCAL_CUDA_PATH=/usr/local/cuda-12.8
export LOCAL_NCCL_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2.26.2
export TF_NEED_CUDA=1
export CLANG_CUDA_COMPILER_PATH="$(which clang)"
python configure.py

Build tensorflow

echo "Good luck building!"
echo "Note, I have trust issues with bazel now, so I always run 'bazel clean --expunge' before building. This may be a personal psychological issue rather than a requirement."
bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined

Script for WSL Ubuntu 22.04

This script should let you compile for RTX 5000 series on WSL Ubuntu 22.04.

Before running this script, be sure to install the latest drivers for your RTX 5000 series card on the Windows side, install WSL2, and use Ubuntu 22.04. Then reboot your PC, that way, WSL2 will be able to see your GPU.

It probably also works on non-WSL Ubuntu 22.04.

It might maybe work on other Ubuntu versions.

It's not going to work for Windows except in WSL.

It may not work at all. Consider copying it line by line and handle errors manually.

mkdir -p "$HOME/rtx5000"
cd "$HOME/rtx5000"

echo "Installing essential dev tools"
sudo apt-get update
sudo apt-get install -y build-essential wget patchelf

echo "Installing Python 3.10"
sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev
curl https://pyenv.run | bash
pyenv install 3.10.12
pyenv global 3.10.12
echo "Restart your shell to use Python 3.10"

echo "After restarting, confirm this says python 3.10.12"
python --version

echo "Make a virtualenv for tensorflow"
python3.10 -m venv ~/rtx5000/venv
echo "Activate the python virtualenv"
source ~/rtx5000/venv/bin/activate

echo "Installing LLVM 20.1.0"
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-20.1.0/LLVM-20.1.0-Linux-X64.tar.xz
tar -xvf LLVM-20.1.0-Linux-X64.tar.xz

echo "Installing NVIDIA packages"
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
echo "Installing NVIDIA CUDA 12.8"
sudo apt-get -y install cuda-toolkit-12-8
echo "Installing NVIDIA cuDNN 9, for CUDA 12"
sudo apt-get -y install cudnn9-cuda-12
echo "Installing NVIDIA NCCL 2"
sudo apt install libnccl2=2.26.2-1+cuda12.8 libnccl-dev=2.26.2-1+cuda12.8

echo "Installing Bazelisk for Bazel"
mkdir -p ~/rtx5000/bin
cd ~/rtx5000/bin
wget 'https://github.com/bazelbuild/bazelisk/releases/download/v1.25.0/bazelisk-linux-amd64'
chmod +x bazelisk-linux-amd64
mv bazelisk-linux-amd64 bazel

Add these lines to your ~/.bashrc or ~/.zshrc file:

export LLVM_HOME="$HOME/rtx5000/LLVM-20.1.0-Linux-X64"
export CUDA_HOME="/usr/local/cuda-12.8"
export PATH="${LLVM_HOME}/bin:${CUDA_HOME}/bin:${HOME}/rtx5000/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$LD_LIBRARY_PATH"
export CPATH="$CUDA_HOME/include:$CPATH"

Restart your terminal.

Test that the LLVM installation worked:

Make this file in ~/rtx5000/card_details.cu:


#include <cuda_runtime.h>
#include <cudnn.h>  // Add cuDNN header
#include <iostream>

int main() {
    cudaDeviceProp prop;
    int device;

    cudaGetDevice(&device); // Get the current device ID
    cudaGetDeviceProperties(&prop, device); // Get device properties

    size_t free_mem, total_mem;
    cudaMemGetInfo(&free_mem, &total_mem); // Get VRAM usage

    std::cout << "> GPU Name: " << prop.name << std::endl;
    std::cout << "> Compute Capability: " << prop.major << "." << prop.minor << std::endl;
    std::cout << "> VRAM Usage: " << (total_mem - free_mem) / (1024 * 1024) << " MB / " << total_mem / (1024 * 1024) << " MB" << std::endl;

    // Print cuDNN version
    std::cout << "> cuDNN Version: "
              << CUDNN_MAJOR << "."
              << CUDNN_MINOR << "."
              << CUDNN_PATCHLEVEL
              << std::endl;

    return 0;
}

Check compilers

echo "This should be LLVM 20.1.0"
which clang
clang --version

echo "This should be CUDA 12.8"
which nvcc
nvcc --version

echo "This might be a recursive symlink, in which case, it should be fixed"
if [[ -L /usr/local/cuda-12.8/lib/lib64 ]]; then
  echo 'RECURSIVE SYMLINK FOUND, REINSTALL CUDA 12.8.1
     You could try:
       sudo rm -r /usr/local/cuda-12.8/lib
       sudo ln -s /usr/local/cuda-12.8/lib64 /usr/local/cuda-12.8/lib
  '
fi

if [[ -f /usr/local/cuda-12.8/lib64/libcudart_static.a ]]; then
  echo Found cudart libs
else
  echo Installing CUDA libs
  sudo apt-get install --reinstall cuda-cudart-dev-12-8
fi

APT_PACKAGES="$(apt --installed list)"
CUDA_PACKAGE_LIST=(
    cuda-cccl-12-8
    cuda-command-line-tools-12-8
    cuda-compiler-12-8
    cuda-crt-12-8
    cuda-cudart-12-8
    cuda-cudart-dev-12-8
    cuda-cuobjdump-12-8
    cuda-cupti-12-8
    cuda-cupti-dev-12-8
    cuda-cuxxfilt-12-8
    cuda-documentation-12-8
    cuda-driver-dev-12-8
    cuda-gdb-12-8
    cuda-libraries-12-8
    cuda-libraries-dev-12-8
    cuda-nsight-12-8
    cuda-nsight-compute-12-8
    cuda-nsight-systems-12-8
    cuda-nvcc-12-8
    cuda-nvdisasm-12-8
    cuda-nvml-dev-12-8
    cuda-nvprof-12-8
    cuda-nvprune-12-8
    cuda-nvrtc-12-8
    cuda-nvrtc-dev-12-8
    cuda-nvtx-12-8
    cuda-nvvm-12-8
    cuda-nvvp-12-8
    cuda-opencl-12-8
    cuda-opencl-dev-12-8
    cuda-profiler-api-12-8
    cuda-sanitizer-12-8
    cuda-toolkit-12-8
    cuda-tools-12-8
    cuda-visual-tools-12-8
    cudnn9-cuda-12-8
)
echo "Make sure you have all the CUDA packages for CUDA 12.8"
for CUDA_PACKAGE in "${CUDA_PACKAGE_LIST[@]}"; do
  if echo "$APT_PACKAGES" | grep "${CUDA_PACKAGE}"; then
    echo "Found: $CUDA_PACKAGE"
  else
    echo "MISSING CUDA PACKAGE: ${CUDA_PACKAGE}"
    break
  fi
done

echo "This should compile the code with nvcc"
cd ~/rtx5000
nvcc -o card_details_nvcc card_details.cu

echo "This should print your card details"
./card_details_nvcc

> GPU Name: NVIDIA GeForce RTX 5090
> Compute Capability: 12.0
> VRAM Usage: 1763 MB / 32606 MB
> cuDNN Version: 9.8.0


echo "This should compile the code with clang++"
clang++ -std=c++17 --cuda-gpu-arch=sm_120 -x cuda --cuda-path="$CUDA_HOME" -I"$CUDA_HOME/include" -L"$CUDA_HOME/lib64"  -lcudart card_details.cu -o card_details_clang

echo "This should print your card details again, just the same as before"
./card_details_clang

> GPU Name: NVIDIA GeForce RTX 5090
> Compute Capability: 12.0
> VRAM Usage: 1763 MB / 32606 MB
> cuDNN Version: 9.8.0

echo "This should be Bazel v8.8.1"
bazel --version

echo "Activate the python virtualenv"
source ~/rtx5000/venv/bin/activate
echo "This should be Python 3.10.12"
python --version

echo "Clone tensorflow"
cd ~/rtx5000
git clone [email protected]:tensorflow/tensorflow.git
cd tensorflow

echo "Add my remote to the repo"
git remote add maludwig '[email protected]:maludwig/tensorflow.git'
echo "Fetch my remote"
git fetch --all
echo "Checkout my branch"
git checkout ml/attempting_build_rtx5090
echo "Pull my branch"
git pull maludwig ml/attempting_build_rtx5090

echo "Configure bazel, these are the settings I used, but I'm not sure if they're correct, or if they just happened to work for me."
export HERMETIC_CUDA_VERSION=12.8.1
export HERMETIC_CUDNN_VERSION=9.8.0
export HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
export LOCAL_CUDA_PATH=/usr/local/cuda-12.8
export LOCAL_NCCL_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2.26.2
export TF_NEED_CUDA=1
export CLANG_CUDA_COMPILER_PATH="$(which clang)"
python configure.py

echo "Good luck building!"
echo "Note, I have trust issues with bazel now, so I always run 'bazel clean --expunge' before building. This may be a personal psychological issue rather than a requirement."
bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel  --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined

NOTE

You mayyyybe need to get the very latest cuDNN with this, but I don't think so.

cd ~/rtx5000
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.8.0.87_cuda12-archive.tar.xz
tar -xvf cudnn-linux-x86_64-9.8.0.87_cuda12-archive.tar.xz

echo add this to your ~/.bashrc
export LD_LIBRARY_PATH="$HOME/rtx5000/cudnn-linux-x86_64-9.8.0.87_cuda12-archive/lib:$LD_LIBRARY_PATH"

NOTE: If this doesn't work for you, let me know which error you got, and maybe I missed something in my environment. Since this was already my dev box, I'm not sure if this is a complete guide, but it's what I did to get it working.

maludwig avatar Mar 21 '25 01:03 maludwig

Hey @Venkat6871 , just seeing your tags you added. To be clear, this is on tf_nightly, not tf 2.18, and I have no idea really what I'm doing, so I'm not gonna PR my extremely busted and tests-failing branch, even though it does build. I put it here so that someone who knows what they're doing could fold in the new stuff more easily, or so that other normal humans like me could run tensorflow on an RTX 5000, instead of just being unable to run it. An actual human who knows what they're doing should look this over and figure it out.

maludwig avatar Mar 22 '25 07:03 maludwig

cd ~/rtx5000 nvcc -o card_details_nvcc card_details.cu -bash: cd: /home/nicolai/rtx5000: No such file or directory cc1plus: fatal error: card_details.cu: No such file or directory compilation terminated.

Nebolon avatar Mar 22 '25 08:03 Nebolon

I tryed to let it run on my wsl.

Build dosen't work, how can I use a prebuiled nightly build?

Configuration: 8850a00e136a9e8be32c557a177e77f38f3c27b70c44518acb5ba0af47f7836b

Execution platform: @@local_execution_config_platform//:platform

In file included from external/local_xla/xla/stream_executor/cuda/cuda_status.cc:16: external/local_xla/xla/stream_executor/cuda/cuda_status.h:22:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found 22 | #include "third_party/gpus/cuda/include/cuda.h" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Target //tensorflow/tools/pip_package:wheel failed to build ERROR: /mnt/c/Projekte/tmp/tensorflow/tensorflow/tools/pip_package/BUILD:293:9 Action tensorflow/tools/pip_package/wheel_house/tensorflow-2.20.0.dev0+selfbuilt-cp312-cp312-linux_x86_64.whl failed: (Exit 1): clang-20 failed: error executing CppCompile command (from target @@local_xla//xla/stream_executor/cuda:cuda_status) (cd /root/.cache/bazel/_bazel_root/509ab554767d44265e0030c4731aba07/execroot/org_tensorflow &&
exec env -
CLANG_CUDA_COMPILER_PATH=/usr/local/bin/clang-20
PATH=/root/.cache/bazelisk/downloads/sha256/c97f02133adce63f0c28678ac1f21d65fa8255c80429b588aeeba8a1fac6202b/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
PWD=/proc/self/cwd
PYTHON_BIN_PATH=/mnt/c/Projekte/env/bin/python3
PYTHON_LIB_PATH=/mnt/c/Projekte/env/lib/python3.12/site-packages
TF2_BEHAVIOR=1
/usr/local/bin/clang-20 -MD -MF bazel-out/k8-opt/bin/external/local_xla/xla/stream_executor/cuda/_objs/cuda_status/cuda_status.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/local_xla/xla/stream_executor/cuda/_objs/cuda_status/cuda_status.pic.o' -iquote external/local_xla -iquote bazel-out/k8-opt/bin/external/local_xla -iquote external/com_google_absl -iquote bazel-out/k8-opt/bin/external/com_google_absl -iquote external/local_config_cuda -iquote bazel-out/k8-opt/bin/external/local_config_cuda -iquote external/cuda_cudart -iquote bazel-out/k8-opt/bin/external/cuda_cudart -iquote external/cuda_cublas -iquote bazel-out/k8-opt/bin/external/cuda_cublas -iquote external/cuda_cccl -iquote bazel-out/k8-opt/bin/external/cuda_cccl -iquote external/cuda_nvtx -iquote bazel-out/k8-opt/bin/external/cuda_nvtx -iquote external/cuda_nvcc -iquote bazel-out/k8-opt/bin/external/cuda_nvcc -iquote external/cuda_cusolver -iquote bazel-out/k8-opt/bin/external/cuda_cusolver -iquote external/cuda_cufft -iquote bazel-out/k8-opt/bin/external/cuda_cufft -iquote external/cuda_cusparse -iquote bazel-out/k8-opt/bin/external/cuda_cusparse -iquote external/cuda_curand -iquote bazel-out/k8-opt/bin/external/cuda_curand -iquote external/cuda_cupti -iquote bazel-out/k8-opt/bin/external/cuda_cupti -iquote external/cuda_nvml -iquote bazel-out/k8-opt/bin/external/cuda_nvml -iquote external/cuda_nvjitlink -iquote bazel-out/k8-opt/bin/external/cuda_nvjitlink -iquote external/local_tsl -iquote bazel-out/k8-opt/bin/external/local_tsl -Ibazel-out/k8-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers -Ibazel-out/k8-opt/bin/external/cuda_cudart/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cublas/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cccl/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_nvtx/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_nvcc/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cusolver/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cufft/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cusparse/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_curand/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cupti/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_nvml/_virtual_includes/headers -Ibaroot@DESKTOP-199root@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mnt/c/Projekte/tmp/tensorflow#

Nebolon avatar Mar 22 '25 09:03 Nebolon

@maludwig and @Venkat6871 is there a build that I can use? (like nightly build)

Nebolon avatar Mar 22 '25 09:03 Nebolon

Hey @Nebolon , scroll up until you see "Script for WSL Ubuntu 22.04" in the comments.

The issue I raised is that there is no build, nightly or otherwise, that supports the latest Blackwell GPUs. I arguably managed to build one myself. You also could. But read through the script I put up above slowly. It looks like you missed some steps. HOPEFULLY the script I wrote will work for someone else, but since I got it working on an old dev box, rather than a brand fresh new black docker container or something, it's likely that I missed a dependency or two.

maludwig avatar Mar 22 '25 10:03 maludwig

@maludwig dosen't work for me. echo "This should be Bazel v8.8.1" bazel --version here I get only 8.1.1 and I get some Errors for the build.

is there any chance when tensorflow will support the 5090 on its own and I can simply use the next version of tensorflow?

If so, please give me a date when.

Nebolon avatar Mar 22 '25 11:03 Nebolon

Yes, I feel problem with NVIDIA RTX 5090 - 32GB Blackwell (not nightly version of PyTorch). I cannot see GPU with TensorFlow success. Can you take a look at https://gist.github.com/donhuvy/6cd637a09b034168d01181d5ce98a5fe . I catch Num GPUs Available: 0 . My environment: Windows 11 pro, JupyterLab latest version, Python 3.11.x .

donhuvy avatar Mar 24 '25 02:03 donhuvy

wait so you got it to work 100%. it sucks my system has 2 5090's and i'm using a cpu for training.

Renardglenn avatar Mar 27 '25 02:03 Renardglenn

Whenever I could not get drivers to work, it usually resolved after installing, reinstalling and changing versions of different packages, since the shortage I doubt there is overwhelming support for the 5090, I remember all launches to have crashing and minimal error bugs that disappear over a relatively short period of time.

I got stuck a while ago similarly on different cards and in general, it might be a tiny thing somewhere with your paths and env.

Try on Linux and see if that works, I don't know why you are using Windows as a senior Dev. My speeds on the 4090 doubled on render times for anything AI/ML related and loading times of nearly everything python vanished.

BrechtCorbeel avatar Mar 27 '25 07:03 BrechtCorbeel

@maludwig dosen't work for me. echo "This should be Bazel v8.8.1" bazel --version here I get only 8.1.1 and I get some Errors for the build.

is there any chance when tensorflow will support the 5090 on its own and I can simply use the next version of tensorflow?

If so, please give me a date when.

Sorry @Nebolon , I'm not a tensorflow employee. I'm just some dude. Can't guess when it will be fixed. I just got my build to work and my personal projects running fine. My tests are failing and I assume that needs resolving.

If your Bazel version is wrong, try installing Bazelisk. See above for instructions.

maludwig avatar Mar 27 '25 19:03 maludwig

wait so you got it to work 100%. it sucks my system has 2 5090's and i'm using a cpu for training.

Yep. For my workflow (training StableDiffusion LoRAs) it works fine. The tests are failing locally, but they must be testing tensorflow components that I am not using.

You could presumably try following in my footsteps and use your 5090s.

maludwig avatar Mar 27 '25 19:03 maludwig

Try on Linux and see if that works, I don't know why you are using Windows as a senior Dev. My speeds on the 4090 doubled on render times for anything AI/ML related and loading times of nearly everything python vanished.

I'm also a senior dev, and while I agree in general that Linux is better and faster, Windows is still a perfectly legit OS. In fact, Apple Silicon is quite nice for training too. There's no distinction between RAM and VRAM in arm64a. Huge models run on consumer hardware. Not near as fast as on nvidia, but OSX is a legit OS too.

maludwig avatar Mar 27 '25 19:03 maludwig

@maludwig I tried following the WSL script and got this error in the final step: "external/local_tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found". All previous steps were OK, such as the one building the .cu file using clang.

jianingchen avatar Mar 29 '25 03:03 jianingchen

@jianingchen

What's your HERMETIC_CUDA_VERSION? It should be 12.8.1

Apart from that, maybe try cleaning the Bazel cache?

# Double check CUDA
echo "HERMETIC_CUDA_VERSION: $HERMETIC_CUDA_VERSION"

# I have trust issues with every cache thing
bazel clean --expunge

maludwig avatar Mar 29 '25 03:03 maludwig

@maludwig It is 12.8.1 correctly. I also tried cleaning the Bazel cache, the error was same: header files in 'third_party/gpus/cuda/include' cannot be found.

jianingchen avatar Mar 29 '25 13:03 jianingchen

@maludwig Some additional info: among the error verbose text, it displayed some environmental variables such as "LD_LIBRARY_PATH" "PATH", etc, but no "CPATH" can be seen. Could this be related to the issue?

jianingchen avatar Mar 29 '25 14:03 jianingchen

@jianingchen find nvtx_utils.cc and try to change it to #include "cuda.h"

@maludwig I tried following the WSL script and got this error in the final step: "external/local_tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found". All previous steps were OK, such as the one building the .cu file using clang.

isbogdanov avatar Mar 31 '25 17:03 isbogdanov