TensorFlow on RTX 5090
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
2.20.0.dev20250314
Custom code
No
OS platform and distribution
Windows 11 - WSL2 - Ubuntu 22.04.5 LTS
Mobile device
No response
Python version
3.10.12
Bazel version
7.4.1
GCC/compiler version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
CUDA/cuDNN version
CUDA Version: 12.8
GPU model and memory
RTX 5090 32GB
Current behavior?
I had hoped that tensorflow would work on the RTX 5090 at all. It does not, sadly. I tried building from source but that didn't work either. I tried running the environment script but that didn't work either. At least bash is my primary programming language, so I was able to tidy that one up here:
https://github.com/tensorflow/tensorflow/pull/89271
But I wasn't able to get tensorflow running. I had a similar issue with PyTorch, which needed to use CUDA 12.8.* to work on the Blackwell cards, but no dice with the nightly build of tensorflow. Below is my test and the output, and under that is the tf_env.txt from my patched script.
It may be helpful to know that nvidia themselves seem to have it running here:
https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel-25-02.html
But I get the same errors that this other guy does when I try it out:
https://www.reddit.com/r/tensorflow/comments/1iutjoj/tensorflow_2501_cuda_128_rtx_5090_on_wsl2_cuda/
This conversation was another one I found that may be helpful, according to these guys, you need to support CUDA 12.8.1 to support Blackwell (aka the RTX 50## series cards):
https://discuss.ai.google.dev/t/building-tensorflow-from-source-for-rtx5000-gpu-series/65171/15
(tfnightie) mitch@win11ml:~/stable_diff
$ cat tfnightie/test_2.py
import tensorflow as tf
import time
# Check if TensorFlow sees the GPU
print("TensorFlow version:", tf.__version__)
print("Available GPUs:", tf.config.experimental.list_physical_devices('GPU'))
# Matrix multiplication test
shape = (5000, 5000)
a = tf.random.normal(shape)
b = tf.random.normal(shape)
# Time execution on GPU
with tf.device('/GPU:0'):
print("Running on GPU...")
start_time = time.time()
c = tf.matmul(a, b)
tf.print("Matrix multiplication (GPU) done.")
print("Execution time (GPU):", time.time() - start_time, "seconds")
# Time execution on CPU for comparison
with tf.device('/CPU:0'):
print("Running on CPU...")
start_time = time.time()
c = tf.matmul(a, b)
tf.print("Matrix multiplication (CPU) done.")
print("Execution time (CPU):", time.time() - start_time, "seconds")
(tfnightie) mitch@win11ml:~/stable_diff
$ python tfnightie/test_2.py
2025-03-14 21:35:33.400099: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
TensorFlow version: 2.20.0-dev20250314
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1742009735.413544 326199 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
Available GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
W0000 00:00:1742009735.417720 326199 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
I0000 00:00:1742009735.572153 326199 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory: -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0
2025-03-14 21:35:36.969440: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_PTX'
2025-03-14 21:35:36.969480: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2025-03-14 21:35:36.969505: W tensorflow/core/framework/op_kernel.cc:1843] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2025-03-14 21:35:36.969533: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
Traceback (most recent call last):
File "/home/mitch/stable_diff/tfnightie/test_2.py", line 10, in <module>
a = tf.random.normal(shape)
File "/home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 6027, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Mul_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Mul] name:
Also, while nvidia's site says that the Compute Capability of the RTX5090 is "10.0", the card itself seems to report "12.0". I am not so sure that info will be helpful, but it spun me for a loop:
$ cat <<EOF > card_details.cu
> #include <cuda_runtime.h>
#include <iostream>
int main() {
cudaDeviceProp prop;
int device;
cudaGetDevice(&device); // Get the current device ID
cudaGetDeviceProperties(&prop, device); // Get device properties
size_t free_mem, total_mem;
cudaMemGetInfo(&free_mem, &total_mem); // Get VRAM usage
std::cout << "GPU Name: " << prop.name << std::endl;
std::cout << "Compute Capability: " << prop.major << "." << prop.minor << std::endl;
std::cout << "VRAM Usage: " << (total_mem - free_mem) / (1024 * 1024) << " MB / " << total_mem / (1024 * 1024) << " MB" << std::endl;
return 0;
}
EOF
$ nvcc card_details.cu -o card_details && ./card_details
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
GPU Name: NVIDIA GeForce RTX 5090
Compute Capability: 12.0
VRAM Usage: 1763 MB / 32606 MB
tf_env.txt
== check python ====================================================
python version: 3.10.12
python branch:
python build version: ('main', 'Feb 4 2025 14:57:36')
python compiler version: GCC 11.4.0
python implementation: CPython
== check os platform ===============================================
os: Linux
os kernel version: #1 SMP Tue Nov 5 00:21:55 UTC 2024
os release version: 5.15.167.4-microsoft-standard-WSL2
os platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
freedesktop os release: {'NAME': 'Ubuntu', 'ID': 'ubuntu', 'PRETTY_NAME': 'Ubuntu 22.04.5 LTS', 'VERSION_ID': '22.04', 'VERSION': '22.04.5 LTS (Jammy Jellyfish)', 'VERSION_CODENAME': 'jammy', 'ID_LIKE': 'debian', 'HOME_URL': 'https://www.ubuntu.com/', 'SUPPORT_URL': 'https://help.ubuntu.com/', 'BUG_REPORT_URL': 'https://bugs.launchpad.net/ubuntu/', 'PRIVACY_POLICY_URL': 'https://www.ubuntu.com/legal/terms-and-policies/privacy-policy', 'UBUNTU_CODENAME': 'jammy'}
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='win11ml', release='5.15.167.4-microsoft-standard-WSL2', version='#1 SMP Tue Nov 5 00:21:55 UTC 2024', machine='x86_64')
architecture: ('64bit', 'ELF')
machine: x86_64
== are we in docker ================================================
No
== c++ compiler ====================================================
/usr/bin/c++
c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== check pips ======================================================
numpy 2.1.3
protobuf 5.29.3
tf_nightly 2.20.0.dev20250314
== check for virtualenv ============================================
Running inside a virtual environment.
== tensorflow import ===============================================
2025-03-14 21:02:48.002965: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1742007769.198398 317963 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
W0000 00:00:1742007769.202246 317963 gpu_device.cc:2429] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
I0000 00:00:1742007769.355021 317963 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory: -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0
tf.version.VERSION = 2.20.0-dev20250314
tf.version.GIT_VERSION = v1.12.1-123444-g07ff428d432
tf.version.COMPILER_VERSION = Ubuntu Clang 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144)
Sanity check: <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>
libcudnn not found
== env =============================================================
LD_LIBRARY_PATH /usr/local/cuda-12.8/lib64:
DYLD_LIBRARY_PATH is unset
== nvidia-smi ======================================================
Fri Mar 14 21:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 572.70 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:09:00.0 Off | N/A |
| 0% 43C P1 78W / 600W | 2115MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 31 G /Xwayland N/A |
| 0 N/A N/A 35 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
== cuda libs =======================================================
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so.11.8.89
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudart.so.12.8.90
== tensorflow installation =========================================
tensorflow not found
== tf_nightly installation =========================================
Name: tf_nightly
Version: 2.20.0.dev20250314
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author-email: [email protected]
License: Apache 2.0
Location: /home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages
Required-by:
== python version ==================================================
(major, minor, micro, releaselevel, serial)
(3, 10, 12, 'final', 0)
== bazel version ===================================================
Bazelisk version: v1.25.0
Build label: 7.4.1
Build time: Mon Nov 11 21:24:53 2024 (1731360293)
Build timestamp: 1731360293
Build timestamp as int: 1731360293
Standalone code to reproduce the issue
Try running anything with an RTX 5090. My test script is above.
Relevant log output
same problem
I should mention that I'm a Senior AI Developer by trade and I'm more than happy to invest my personal time in helping to fix this, I'm just not sure where to start.
I should also mention that the latest clang release here supports building for compute_100/sm_100+
https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.0
It's not supported in LLVM 18. But it compiles this on my GPU just fine (extra logs attached just in case they help someone else).
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ clang++ --version
clang version 20.1.0 (https://github.com/llvm/llvm-project 24a30daaa559829ad079f2ff7f73eb4e18095f88)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/mitch/stable_diff/fix_tf/llvm/LLVM-20.1.0-Linux-X64/bin
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ cat card_details.cu
#include <cuda_runtime.h>
#include <iostream>
int main() {
cudaDeviceProp prop;
int device;
cudaGetDevice(&device); // Get the current device ID
cudaGetDeviceProperties(&prop, device); // Get device properties
size_t free_mem, total_mem;
cudaMemGetInfo(&free_mem, &total_mem); // Get VRAM usage
std::cout << "GPU Name: " << prop.name << std::endl;
std::cout << "Compute Capability: " << prop.major << "." << prop.minor << std::endl;
std::cout << "VRAM Usage: " << (total_mem - free_mem) / (1024 * 1024) << " MB / " << total_mem / (1024 * 1024) << " MB" << std::endl;
return 0;
}
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ clang++ -std=c++17 --cuda-gpu-arch=sm_120 -x cuda --cuda-path="$CUDA_HOME" -I"$CUDA_HOME/include" -L"$CUDA_HOME/lib64" -lcudart card_details.cu -o card_details
clang++: warning: CUDA version 12.8 is only partially supported [-Wunknown-cuda-version]
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ ./card_details
GPU Name: NVIDIA GeForce RTX 5090
Compute Capability: 12.0
VRAM Usage: 1763 MB / 32606 MB
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ echo "$CUDA_HOME"
/usr/local/cuda-12.8
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ ls "$CUDA_HOME"
DOCS EULA.txt README bin compute-sanitizer doc extras gds include lib64 libnvvp nsightee_plugins nvml nvvm share src targets tools version.json
mitch@win11ml:~/stable_diff/build_tf/hello/hello_nvcc
$ cat /usr/local/cuda-12.8/version.json | head -n5
{
"cuda" : {
"name" : "CUDA SDK",
"version" : "12.8.1"
},
I'm going to keep writing my attempts to get things working here. I've cut a branch on my fork, still no luck, but here's some half-discoveries. More and more of the project is building as I continue, zero idea how far away I am from victory. He's the branch I'm on, compared with the base:
https://github.com/maludwig/tensorflow/compare/ml/fixing_tf_env...maludwig:tensorflow:ml/attempting_build_rtx5090?expand=1
A few findings:
- CUDA 12.8.1 adds support for the RTX 5090 (and other Blackwells), so we need that
- There's a bug in cutlass, which was forked for tensorflow for a reason I don't know, the bug was fixed here: https://github.com/NVIDIA/cutlass/pull/1784/files
- The old fork, done by @chsigg was certainly done for a reason, no idea what I'm breaking by going back to the NVIDIA main branch here. Not sure how to message people on GitHub, but maybe they'll get notified on this?
- I updated NCCL to the latest 2.26.2 wheel
- Build is still failing, but it's taking WAY longer to fail now. This is possibly a good sign.
Yep, I'm stopping for the night, it's currently stuck on what seems to be duplicate logging macros, looks like maybe two different logging libraries are somehow being included at the same time. Two very very similar logging libraries. But instead of taking 30 seconds before it fails, now it takes 17 minutes to fail, which I define as progress!
external/com_google_absl/absl/log/check.h:122:9: warning: 'CHECK_LT' macro redefined [-Wmacro-redefined]
122 | #define CHECK_LT(val1, val2) \
| ^
external/local_xla/xla/tsl/platform/default/logging.h:498:9: note: previous definition is here
498 | #define CHECK_LT(val1, val2) CHECK_OP(Check_LT, <, val1, val2)
| ^
In file included from tensorflow/core/kernels/fill_empty_rows_functor_gpu.cu.cc:21:
In file included from ./tensorflow/core/common_runtime/gpu/gpu_event_mgr.h:21:
In file included from ./tensorflow/core/common_runtime/device/device_event_mgr.h:30:
In file included from ./tensorflow/core/platform/stream_executor.h:21:
In file included from external/local_xla/xla/stream_executor/dnn.h:47:
In file included from external/local_xla/xla/stream_executor/scratch_allocator.h:26:
In file included from external/local_xla/xla/stream_executor/device_memory_allocator.h:22:
external/com_google_absl/absl/log/check.h:124:9: warning: 'CHECK_GE' macro redefined [-Wmacro-redefined]
124 | #define CHECK_GE(val1, val2) \
| ^
external/local_xla/xla/tsl/platform/default/logging.h:499:9: note: previous definition is here
499 | #define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
| ^
In file included from tensorflow/core/kernels/fill_empty_rows_functor_gpu.cu.cc:21:
In file included from ./tensorflow/core/common_runtime/gpu/gpu_event_mgr.h:21:
In file included from ./tensorflow/core/common_runtime/device/device_event_mgr.h:30:
In file included from ./tensorflow/core/platform/stream_executor.h:21:
In file included from external/local_xla/xla/stream_executor/dnn.h:47:
In file included from external/local_xla/xla/stream_executor/scratch_allocator.h:26:
In file included from external/local_xla/xla/stream_executor/device_memory_allocator.h:22:
VICTORY
Ok I didn't stop for the night. Instead, I just ignored all manner of warnings that shouldn't be ignored:
bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined
And bam!
INFO: Found 1 target...
Target //tensorflow/tools/pip_package:wheel up-to-date:
bazel-bin/tensorflow/tools/pip_package/wheel_house/tensorflow-2.20.0.dev0+selfbuilt-cp310-cp310-linux_x86_64.whl
INFO: Elapsed time: 87.690s, Critical Path: 86.67s
INFO: 2 processes: 1 internal, 1 local.
INFO: Build completed successfully, 2 total actions
No idea if it'll work, but it did build! I've pushed the latest code changes to my branch.
https://github.com/maludwig/tensorflow/compare/ml/fixing_tf_env...maludwig:tensorflow:ml/attempting_build_rtx5090?expand=1
It passed one test!
(tfnightie) mitch@win11ml:~/stable_diff/fix_tf/tensorflow
$ bazel test --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined tensorflow/python/kernel_tests/nn_ops:softmax_op_test
WARNING: The following configs were expanded more than once: [cuda_clang, cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
INFO: Reading 'startup' options from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
Inherited 'common' options: --isatty=1 --terminal_columns=243
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc:
Inherited 'common' options: --announce_rc --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility --noenable_bzlmod --noincompatible_enable_cc_toolchain_resolution --noincompatible_enable_android_toolchain_resolution --experimental_repo_remote_exec --java_runtime_version=remotejdk_21
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc:
Inherited 'build' options: --repo_env=ML_WHEEL_TYPE=snapshot --repo_env=ML_WHEEL_BUILD_DATE= --repo_env=ML_WHEEL_VERSION_SUFFIX= --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --host_features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc:
Inherited 'build' options: --action_env PYTHON_BIN_PATH=/home/mitch/.virtualenvs/tfnightie/bin/python --action_env PYTHON_LIB_PATH=/home/mitch/.virtualenvs/tfnightie/lib/python3.10/site-packages --python_path=/home/mitch/.virtualenvs/tfnightie/bin/python --action_env LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:/home/mitch/stable_diff/fix_tf/libs/cudnn-linux-x86_64-9.8.0.87_cuda12-archive/lib: --config=cuda_clang --action_env CLANG_CUDA_COMPILER_PATH=/home/mitch/stable_diff/fix_tf/llvm/LLVM-20.1.0-Linux-X64/bin/clang-20 --config=cuda_clang
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc:
'test' options: --test_env=GTEST_INSTALL_FAILURE_SIGNAL_HANDLER=1
INFO: Reading rc options for 'test' from /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc:
'test' options: --test_size_filters=small,medium --test_env=LD_LIBRARY_PATH
INFO: Found applicable config definition build:short_logs in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition test:v2 in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --test_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-oss_serial,-v1only --build_tag_filters=-benchmark-test,-no_oss,-oss_excluded,-no_gpu,-v1only
INFO: Found applicable config definition build:cuda_clang in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --config=cuda --@local_config_cuda//:cuda_compiler=clang --copt=-Qunused-arguments --repo_env=HERMETIC_CUDA_COMPUTE_CAPABILITIES=sm_60,sm_70,sm_80,sm_89,compute_90 --copt=-Wno-unknown-cuda-version --host_linkopt=-fuse-ld=lld --host_linkopt=-lm --linkopt=-fuse-ld=lld --linkopt=-lm
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda --repo_env=HERMETIC_CUDA_VERSION=12.5.1 --repo_env=HERMETIC_CUDNN_VERSION=9.3.0 --@local_config_cuda//cuda:include_cuda_libs=true
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --repo_env HERMETIC_CUDA_VERSION=12.8.1 --repo_env HERMETIC_CUDNN_VERSION=9.8.0 --repo_env HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
INFO: Found applicable config definition build:cuda_clang in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --config=cuda --@local_config_cuda//:cuda_compiler=clang --copt=-Qunused-arguments --repo_env=HERMETIC_CUDA_COMPUTE_CAPABILITIES=sm_60,sm_70,sm_80,sm_89,compute_90 --copt=-Wno-unknown-cuda-version --host_linkopt=-fuse-ld=lld --host_linkopt=-lm --linkopt=-fuse-ld=lld --linkopt=-lm
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda --repo_env=HERMETIC_CUDA_VERSION=12.5.1 --repo_env=HERMETIC_CUDNN_VERSION=9.3.0 --@local_config_cuda//cuda:include_cuda_libs=true
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --repo_env HERMETIC_CUDA_VERSION=12.8.1 --repo_env HERMETIC_CUDNN_VERSION=9.8.0 --repo_env HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda --repo_env=HERMETIC_CUDA_VERSION=12.5.1 --repo_env=HERMETIC_CUDNN_VERSION=9.3.0 --@local_config_cuda//cuda:include_cuda_libs=true
INFO: Found applicable config definition build:cuda in file /home/mitch/stable_diff/fix_tf/tensorflow/.tf_configure.bazelrc: --repo_env HERMETIC_CUDA_VERSION=12.8.1 --repo_env HERMETIC_CUDNN_VERSION=9.8.0 --repo_env HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
INFO: Found applicable config definition build:cuda_wheel in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --@local_config_cuda//cuda:include_cuda_libs=false
INFO: Found applicable config definition build:linux in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /home/mitch/stable_diff/fix_tf/tensorflow/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
DEBUG: /home/mitch/.cache/bazel/_bazel_mitch/98f54844abcf3e1cdc99e9d96b271d9e/external/local_xla/third_party/py/python_repo.bzl:154:14:
HERMETIC_PYTHON_VERSION variable was not set correctly, using default version.
Python 3.10 will be used.
To select Python version, either set HERMETIC_PYTHON_VERSION env variable in
your shell:
export HERMETIC_PYTHON_VERSION=3.12
OR pass it as an argument to bazel command directly or inside your .bazelrc
file:
--repo_env=HERMETIC_PYTHON_VERSION=3.12
DEBUG: /home/mitch/.cache/bazel/_bazel_mitch/98f54844abcf3e1cdc99e9d96b271d9e/external/local_xla/third_party/py/python_repo.bzl:87:10:
=============================
Hermetic Python configuration:
Version: "3.10"
Kind: ""
Interpreter: "default" (provided by rules_python)
Requirements_lock label: "@python_version_repo//:requirements_lock_3_10.txt"
=====================================
WARNING: The following configs were expanded more than once: [cuda_clang, cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
WARNING: Build options --@@local_config_cuda//cuda:include_cuda_libs, --copt, --cxxopt, and 2 more have changed, discarding analysis cache (this can be expensive, see https://bazel.build/advanced/performance/iteration-speed).
INFO: Analyzed 2 targets (749 packages loaded, 56015 targets configured).
INFO: Found 2 test targets...
INFO: Elapsed time: 270.116s, Critical Path: 245.71s
INFO: 2560 processes: 378 internal, 2182 local.
INFO: Build completed successfully, 2560 total actions
//tensorflow/python/kernel_tests/nn_ops:softmax_op_test_cpu PASSED in 217.4s
//tensorflow/python/kernel_tests/nn_ops:softmax_op_test_gpu PASSED in 218.4s
Executed 2 out of 2 tests: 2 tests pass.
I also installed the wheel generated in the last step to a new python venv, and it worked!
(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ python -c "import tensorflow as tf; print(tf.__version__)"
2025-03-17 04:37:51.455319: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742207871.466384 646442 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742207871.469996 646442 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742207871.479137 646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207871.479166 646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207871.479169 646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207871.479172 646442 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 04:37:51.481701: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2.20.0-dev0+selfbuilt
(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2025-03-17 04:38:02.348770: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742207882.360431 646471 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742207882.364089 646471 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742207882.373383 646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207882.373422 646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207882.373426 646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207882.373437 646471 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 04:38:02.376028: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ cat test_gpu.py
import tensorflow as tf
# Check if GPU is available
gpus = tf.config.list_physical_devices('GPU')
if not gpus:
print("🚫 No GPU found!")
else:
print(f"✅ Found GPU(s): {[gpu.name for gpu in gpus]}")
# Place operations on GPU
with tf.device('/GPU:0'):
# Create two tensors
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[5.0, 6.0], [7.0, 8.0]])
# Add tensors
add_result = tf.add(a, b)
print("\nAddition result:")
print(add_result)
# Matrix multiplication
matmul_result = tf.matmul(a, b)
print("\nMatrix multiplication result:")
print(matmul_result)
# Print device placement info (optional, debug)
print("\nDevice placement log:")
tf.debugging.set_log_device_placement(True)
(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ python test_gpu.py
2025-03-17 04:38:25.409242: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742207905.420314 646517 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742207905.423851 646517 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742207905.432651 646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207905.432680 646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207905.432684 646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742207905.432686 646517 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 04:38:25.435305: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
✅ Found GPU(s): ['/physical_device:GPU:0']
I0000 00:00:1742207906.790435 646517 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory: -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0
Addition result:
tf.Tensor(
[[ 6. 8.]
[10. 12.]], shape=(2, 2), dtype=float32)
Matrix multiplication result:
tf.Tensor(
[[19. 22.]
[43. 50.]], shape=(2, 2), dtype=float32)
Device placement log:
I...am...going...to...run all the tests overnight? My build process is complete trash and I have no idea what I'm doing, but I COULD also PR this code, but like, that's slightly terrifying. I've ignoring probably thousands of warnings that a competent C++ developer could probably actually solve, rather than just ignore...
Tests didn't pass, but it did build! And it could do basic matrix addition and multiplication in Python! NOW I'm definitely going to bed though.
It also is able to do the classic "hello world" ML task of learning digits on MNIST, but the warnings are PLENTIFUL and cryptic. I don't know what they mean, but the final model happens to work great!
(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ cat mnist_test.py
#!/usr/bin/env python
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Normalize pixel values to [0,1]
x_train = x_train / 255.0
x_test = x_test / 255.0
# Build the model
model = models.Sequential([
layers.Flatten(input_shape=(28, 28)), # Flatten 28x28 to 784
layers.Dense(128, activation='relu'), # Hidden layer
layers.Dense(10, activation='softmax') # Output layer
])
# Compile the model
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=5, validation_split=0.1)
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_acc:.4f}")
# Make predictions
predictions = model.predict(x_test)
# Example: Print prediction for the first image
print(f"First test sample - Predicted: {np.argmax(predictions[0])}, Actual: {y_test[0]}")
(test5090build) mitch@win11ml:~/stable_diff/fix_tf/test5090build
$ ./mnist_test.py
2025-03-17 11:23:11.786039: I external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742232191.796888 662647 cuda_dnn.cc:8670] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
I0000 00:00:1742232191.800405 662647 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1742232191.809207 662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742232191.809234 662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742232191.809238 662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1742232191.809259 662647 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-03-17 11:23:11.811904: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/mitch/.virtualenvs/test5090build/lib/python3.10/site-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
super().__init__(**kwargs)
I0000 00:00:1742232193.910442 662647 gpu_device.cc:2018] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29043 MB memory: -> device: 0, name: NVIDIA GeForce RTX 5090, pci bus id: 0000:09:00.0, compute capability: 12.0
Epoch 1/5
2025-03-17 11:23:15.366974: I external/local_xla/xla/service/service.cc:152] XLA service 0x7f5928008d30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2025-03-17 11:23:15.367006: I external/local_xla/xla/service/service.cc:160] StreamExecutor device (0): NVIDIA GeForce RTX 5090, Compute Capability 12.0
2025-03-17 11:23:15.376904: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1742232195.434380 662725 cuda_dnn.cc:529] Loaded cuDNN version 90800
2025-03-17 11:23:16.211437: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 80 bytes spill stores, 80 bytes spill loads
2025-03-17 11:23:16.220559: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95_0', 164 bytes spill stores, 164 bytes spill loads
2025-03-17 11:23:16.340804: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 392 bytes spill stores, 392 bytes spill loads
2025-03-17 11:23:16.364181: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads
2025-03-17 11:23:16.374280: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 176 bytes spill stores, 176 bytes spill loads
2025-03-17 11:23:16.385374: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads
2025-03-17 11:23:16.393417: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 292 bytes spill stores, 292 bytes spill loads
2025-03-17 11:23:16.451825: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 532 bytes spill stores, 532 bytes spill loads
2025-03-17 11:23:16.522600: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 168 bytes spill stores, 168 bytes spill loads
2025-03-17 11:23:16.556430: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 1040 bytes spill stores, 1040 bytes spill loads
2025-03-17 11:23:16.607519: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 112 bytes spill stores, 112 bytes spill loads
2025-03-17 11:23:16.806055: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 4920 bytes spill stores, 4992 bytes spill loads
2025-03-17 11:23:16.867917: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 5084 bytes spill stores, 5028 bytes spill loads
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742232197.511568 662725 device_compiler.h:196] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
1684/1688 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.8767 - loss: 0.44642025-03-17 11:23:20.774706: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 32 bytes spill stores, 32 bytes spill loads
2025-03-17 11:23:20.804309: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 288 bytes spill stores, 288 bytes spill loads
2025-03-17 11:23:20.807183: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads
2025-03-17 11:23:20.828074: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 76 bytes spill stores, 76 bytes spill loads
2025-03-17 11:23:20.895561: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 752 bytes spill stores, 752 bytes spill loads
2025-03-17 11:23:21.076717: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 80 bytes spill stores, 80 bytes spill loads
2025-03-17 11:23:21.096547: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_232', 72 bytes spill stores, 72 bytes spill loads
2025-03-17 11:23:21.177785: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 4920 bytes spill stores, 4992 bytes spill loads
2025-03-17 11:23:21.227351: I external/local_xla/xla/stream_executor/cuda/subprocess_compilation.cc:346] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_95', 5084 bytes spill stores, 5028 bytes spill loads
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 8s 3ms/step - accuracy: 0.8768 - loss: 0.4459 - val_accuracy: 0.9668 - val_loss: 0.1275
Epoch 2/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9598 - loss: 0.1359 - val_accuracy: 0.9710 - val_loss: 0.0985
Epoch 3/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9748 - loss: 0.0853 - val_accuracy: 0.9728 - val_loss: 0.0920
Epoch 4/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9810 - loss: 0.0640 - val_accuracy: 0.9775 - val_loss: 0.0809
Epoch 5/5
1688/1688 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - accuracy: 0.9855 - loss: 0.0473 - val_accuracy: 0.9782 - val_loss: 0.0797
313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.9720 - loss: 0.0911
Test accuracy: 0.9753
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step
First test sample - Predicted: 7, Actual: 7
@maludwig you should upgrade commit hash256 XLA on bazel file and it should work
Sorry that's a bit cryptic for me. I'm normally a Python dev, apologies. Did you mean in my commits on my branch above?
https://github.com/maludwig/tensorflow/compare/ml/fixing_tf_env...maludwig:tensorflow:ml/attempting_build_rtx5090?expand=1
+1
Steps to get it running on your RTX 5000 series card
Guide for all platforms
Install llvm 20.1.0
LLVM 20.1.0 is required to compile code for compute capability 10.0 and 12.0 (RTX 5000 series).
All platforms here:
https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.0
Install CUDA 12.8.1
CUDA 12.8.1 is required to compile code for compute capability 10.0 and 12.0 (RTX 5000 series).
Also install cuDNN 9.8.0 and NCCL 2, for CUDA 12.
Install Python 3.10.12
This just happens to be the version I'm using and may be completely unnecessary. I personally love pyenv because it installs it to your local user, so you don't need to fret about admin/root permissions.
Make a Python venv for tensorflow
This will prevent your system from being polluted by tensorflow dependencies, and will make it much much much easier to clean up if you want to start over.
Install Bazelisk
Bazelisk is a wrapper for Bazel that downloads the correct version of Bazel for the project.
Clone tensorflow
echo "Clone tensorflow"
git clone [email protected]:tensorflow/tensorflow.git
cd tensorflow
echo "Add my remote to the repo"
git remote add maludwig '[email protected]:maludwig/tensorflow.git'
echo "Fetch my remote"
git fetch --all
echo "Checkout my branch"
git checkout ml/attempting_build_rtx5090
echo "Pull my branch"
git pull maludwig ml/attempting_build_rtx5090
Configure bazel
echo "Configure bazel, these are the settings I used, but I'm not sure if they're correct, or if they just happened to work for me."
export HERMETIC_CUDA_VERSION=12.8.1
export HERMETIC_CUDNN_VERSION=9.8.0
export HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
export LOCAL_CUDA_PATH=/usr/local/cuda-12.8
export LOCAL_NCCL_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2.26.2
export TF_NEED_CUDA=1
export CLANG_CUDA_COMPILER_PATH="$(which clang)"
python configure.py
Build tensorflow
echo "Good luck building!"
echo "Note, I have trust issues with bazel now, so I always run 'bazel clean --expunge' before building. This may be a personal psychological issue rather than a requirement."
bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined
Script for WSL Ubuntu 22.04
This script should let you compile for RTX 5000 series on WSL Ubuntu 22.04.
Before running this script, be sure to install the latest drivers for your RTX 5000 series card on the Windows side, install WSL2, and use Ubuntu 22.04. Then reboot your PC, that way, WSL2 will be able to see your GPU.
It probably also works on non-WSL Ubuntu 22.04.
It might maybe work on other Ubuntu versions.
It's not going to work for Windows except in WSL.
It may not work at all. Consider copying it line by line and handle errors manually.
mkdir -p "$HOME/rtx5000"
cd "$HOME/rtx5000"
echo "Installing essential dev tools"
sudo apt-get update
sudo apt-get install -y build-essential wget patchelf
echo "Installing Python 3.10"
sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev
curl https://pyenv.run | bash
pyenv install 3.10.12
pyenv global 3.10.12
echo "Restart your shell to use Python 3.10"
echo "After restarting, confirm this says python 3.10.12"
python --version
echo "Make a virtualenv for tensorflow"
python3.10 -m venv ~/rtx5000/venv
echo "Activate the python virtualenv"
source ~/rtx5000/venv/bin/activate
echo "Installing LLVM 20.1.0"
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-20.1.0/LLVM-20.1.0-Linux-X64.tar.xz
tar -xvf LLVM-20.1.0-Linux-X64.tar.xz
echo "Installing NVIDIA packages"
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
echo "Installing NVIDIA CUDA 12.8"
sudo apt-get -y install cuda-toolkit-12-8
echo "Installing NVIDIA cuDNN 9, for CUDA 12"
sudo apt-get -y install cudnn9-cuda-12
echo "Installing NVIDIA NCCL 2"
sudo apt install libnccl2=2.26.2-1+cuda12.8 libnccl-dev=2.26.2-1+cuda12.8
echo "Installing Bazelisk for Bazel"
mkdir -p ~/rtx5000/bin
cd ~/rtx5000/bin
wget 'https://github.com/bazelbuild/bazelisk/releases/download/v1.25.0/bazelisk-linux-amd64'
chmod +x bazelisk-linux-amd64
mv bazelisk-linux-amd64 bazel
Add these lines to your ~/.bashrc or ~/.zshrc file:
export LLVM_HOME="$HOME/rtx5000/LLVM-20.1.0-Linux-X64"
export CUDA_HOME="/usr/local/cuda-12.8"
export PATH="${LLVM_HOME}/bin:${CUDA_HOME}/bin:${HOME}/rtx5000/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$LD_LIBRARY_PATH"
export CPATH="$CUDA_HOME/include:$CPATH"
Restart your terminal.
Test that the LLVM installation worked:
Make this file in ~/rtx5000/card_details.cu:
#include <cuda_runtime.h>
#include <cudnn.h> // Add cuDNN header
#include <iostream>
int main() {
cudaDeviceProp prop;
int device;
cudaGetDevice(&device); // Get the current device ID
cudaGetDeviceProperties(&prop, device); // Get device properties
size_t free_mem, total_mem;
cudaMemGetInfo(&free_mem, &total_mem); // Get VRAM usage
std::cout << "> GPU Name: " << prop.name << std::endl;
std::cout << "> Compute Capability: " << prop.major << "." << prop.minor << std::endl;
std::cout << "> VRAM Usage: " << (total_mem - free_mem) / (1024 * 1024) << " MB / " << total_mem / (1024 * 1024) << " MB" << std::endl;
// Print cuDNN version
std::cout << "> cuDNN Version: "
<< CUDNN_MAJOR << "."
<< CUDNN_MINOR << "."
<< CUDNN_PATCHLEVEL
<< std::endl;
return 0;
}
Check compilers
echo "This should be LLVM 20.1.0"
which clang
clang --version
echo "This should be CUDA 12.8"
which nvcc
nvcc --version
echo "This might be a recursive symlink, in which case, it should be fixed"
if [[ -L /usr/local/cuda-12.8/lib/lib64 ]]; then
echo 'RECURSIVE SYMLINK FOUND, REINSTALL CUDA 12.8.1
You could try:
sudo rm -r /usr/local/cuda-12.8/lib
sudo ln -s /usr/local/cuda-12.8/lib64 /usr/local/cuda-12.8/lib
'
fi
if [[ -f /usr/local/cuda-12.8/lib64/libcudart_static.a ]]; then
echo Found cudart libs
else
echo Installing CUDA libs
sudo apt-get install --reinstall cuda-cudart-dev-12-8
fi
APT_PACKAGES="$(apt --installed list)"
CUDA_PACKAGE_LIST=(
cuda-cccl-12-8
cuda-command-line-tools-12-8
cuda-compiler-12-8
cuda-crt-12-8
cuda-cudart-12-8
cuda-cudart-dev-12-8
cuda-cuobjdump-12-8
cuda-cupti-12-8
cuda-cupti-dev-12-8
cuda-cuxxfilt-12-8
cuda-documentation-12-8
cuda-driver-dev-12-8
cuda-gdb-12-8
cuda-libraries-12-8
cuda-libraries-dev-12-8
cuda-nsight-12-8
cuda-nsight-compute-12-8
cuda-nsight-systems-12-8
cuda-nvcc-12-8
cuda-nvdisasm-12-8
cuda-nvml-dev-12-8
cuda-nvprof-12-8
cuda-nvprune-12-8
cuda-nvrtc-12-8
cuda-nvrtc-dev-12-8
cuda-nvtx-12-8
cuda-nvvm-12-8
cuda-nvvp-12-8
cuda-opencl-12-8
cuda-opencl-dev-12-8
cuda-profiler-api-12-8
cuda-sanitizer-12-8
cuda-toolkit-12-8
cuda-tools-12-8
cuda-visual-tools-12-8
cudnn9-cuda-12-8
)
echo "Make sure you have all the CUDA packages for CUDA 12.8"
for CUDA_PACKAGE in "${CUDA_PACKAGE_LIST[@]}"; do
if echo "$APT_PACKAGES" | grep "${CUDA_PACKAGE}"; then
echo "Found: $CUDA_PACKAGE"
else
echo "MISSING CUDA PACKAGE: ${CUDA_PACKAGE}"
break
fi
done
echo "This should compile the code with nvcc"
cd ~/rtx5000
nvcc -o card_details_nvcc card_details.cu
echo "This should print your card details"
./card_details_nvcc
> GPU Name: NVIDIA GeForce RTX 5090
> Compute Capability: 12.0
> VRAM Usage: 1763 MB / 32606 MB
> cuDNN Version: 9.8.0
echo "This should compile the code with clang++"
clang++ -std=c++17 --cuda-gpu-arch=sm_120 -x cuda --cuda-path="$CUDA_HOME" -I"$CUDA_HOME/include" -L"$CUDA_HOME/lib64" -lcudart card_details.cu -o card_details_clang
echo "This should print your card details again, just the same as before"
./card_details_clang
> GPU Name: NVIDIA GeForce RTX 5090
> Compute Capability: 12.0
> VRAM Usage: 1763 MB / 32606 MB
> cuDNN Version: 9.8.0
echo "This should be Bazel v8.8.1"
bazel --version
echo "Activate the python virtualenv"
source ~/rtx5000/venv/bin/activate
echo "This should be Python 3.10.12"
python --version
echo "Clone tensorflow"
cd ~/rtx5000
git clone [email protected]:tensorflow/tensorflow.git
cd tensorflow
echo "Add my remote to the repo"
git remote add maludwig '[email protected]:maludwig/tensorflow.git'
echo "Fetch my remote"
git fetch --all
echo "Checkout my branch"
git checkout ml/attempting_build_rtx5090
echo "Pull my branch"
git pull maludwig ml/attempting_build_rtx5090
echo "Configure bazel, these are the settings I used, but I'm not sure if they're correct, or if they just happened to work for me."
export HERMETIC_CUDA_VERSION=12.8.1
export HERMETIC_CUDNN_VERSION=9.8.0
export HERMETIC_CUDA_COMPUTE_CAPABILITIES=compute_120
export LOCAL_CUDA_PATH=/usr/local/cuda-12.8
export LOCAL_NCCL_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2.26.2
export TF_NEED_CUDA=1
export CLANG_CUDA_COMPILER_PATH="$(which clang)"
python configure.py
echo "Good luck building!"
echo "Note, I have trust issues with bazel now, so I always run 'bazel clean --expunge' before building. This may be a personal psychological issue rather than a requirement."
bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel --copt=-Wno-gnu-offsetof-extensions --copt=-Wno-error --copt=-Wno-c23-extensions --verbose_failures --copt=-Wno-macro-redefined
NOTE
You mayyyybe need to get the very latest cuDNN with this, but I don't think so.
cd ~/rtx5000
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.8.0.87_cuda12-archive.tar.xz
tar -xvf cudnn-linux-x86_64-9.8.0.87_cuda12-archive.tar.xz
echo add this to your ~/.bashrc
export LD_LIBRARY_PATH="$HOME/rtx5000/cudnn-linux-x86_64-9.8.0.87_cuda12-archive/lib:$LD_LIBRARY_PATH"
NOTE: If this doesn't work for you, let me know which error you got, and maybe I missed something in my environment. Since this was already my dev box, I'm not sure if this is a complete guide, but it's what I did to get it working.
Hey @Venkat6871 , just seeing your tags you added. To be clear, this is on tf_nightly, not tf 2.18, and I have no idea really what I'm doing, so I'm not gonna PR my extremely busted and tests-failing branch, even though it does build. I put it here so that someone who knows what they're doing could fold in the new stuff more easily, or so that other normal humans like me could run tensorflow on an RTX 5000, instead of just being unable to run it. An actual human who knows what they're doing should look this over and figure it out.
cd ~/rtx5000 nvcc -o card_details_nvcc card_details.cu -bash: cd: /home/nicolai/rtx5000: No such file or directory cc1plus: fatal error: card_details.cu: No such file or directory compilation terminated.
I tryed to let it run on my wsl.
Build dosen't work, how can I use a prebuiled nightly build?
Configuration: 8850a00e136a9e8be32c557a177e77f38f3c27b70c44518acb5ba0af47f7836b
Execution platform: @@local_execution_config_platform//:platform
In file included from external/local_xla/xla/stream_executor/cuda/cuda_status.cc:16:
external/local_xla/xla/stream_executor/cuda/cuda_status.h:22:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found
22 | #include "third_party/gpus/cuda/include/cuda.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Target //tensorflow/tools/pip_package:wheel failed to build
ERROR: /mnt/c/Projekte/tmp/tensorflow/tensorflow/tools/pip_package/BUILD:293:9 Action tensorflow/tools/pip_package/wheel_house/tensorflow-2.20.0.dev0+selfbuilt-cp312-cp312-linux_x86_64.whl failed: (Exit 1): clang-20 failed: error executing CppCompile command (from target @@local_xla//xla/stream_executor/cuda:cuda_status)
(cd /root/.cache/bazel/_bazel_root/509ab554767d44265e0030c4731aba07/execroot/org_tensorflow &&
exec env -
CLANG_CUDA_COMPILER_PATH=/usr/local/bin/clang-20
PATH=/root/.cache/bazelisk/downloads/sha256/c97f02133adce63f0c28678ac1f21d65fa8255c80429b588aeeba8a1fac6202b/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
PWD=/proc/self/cwd
PYTHON_BIN_PATH=/mnt/c/Projekte/env/bin/python3
PYTHON_LIB_PATH=/mnt/c/Projekte/env/lib/python3.12/site-packages
TF2_BEHAVIOR=1
/usr/local/bin/clang-20 -MD -MF bazel-out/k8-opt/bin/external/local_xla/xla/stream_executor/cuda/_objs/cuda_status/cuda_status.pic.d '-frandom-seed=bazel-out/k8-opt/bin/external/local_xla/xla/stream_executor/cuda/_objs/cuda_status/cuda_status.pic.o' -iquote external/local_xla -iquote bazel-out/k8-opt/bin/external/local_xla -iquote external/com_google_absl -iquote bazel-out/k8-opt/bin/external/com_google_absl -iquote external/local_config_cuda -iquote bazel-out/k8-opt/bin/external/local_config_cuda -iquote external/cuda_cudart -iquote bazel-out/k8-opt/bin/external/cuda_cudart -iquote external/cuda_cublas -iquote bazel-out/k8-opt/bin/external/cuda_cublas -iquote external/cuda_cccl -iquote bazel-out/k8-opt/bin/external/cuda_cccl -iquote external/cuda_nvtx -iquote bazel-out/k8-opt/bin/external/cuda_nvtx -iquote external/cuda_nvcc -iquote bazel-out/k8-opt/bin/external/cuda_nvcc -iquote external/cuda_cusolver -iquote bazel-out/k8-opt/bin/external/cuda_cusolver -iquote external/cuda_cufft -iquote bazel-out/k8-opt/bin/external/cuda_cufft -iquote external/cuda_cusparse -iquote bazel-out/k8-opt/bin/external/cuda_cusparse -iquote external/cuda_curand -iquote bazel-out/k8-opt/bin/external/cuda_curand -iquote external/cuda_cupti -iquote bazel-out/k8-opt/bin/external/cuda_cupti -iquote external/cuda_nvml -iquote bazel-out/k8-opt/bin/external/cuda_nvml -iquote external/cuda_nvjitlink -iquote bazel-out/k8-opt/bin/external/cuda_nvjitlink -iquote external/local_tsl -iquote bazel-out/k8-opt/bin/external/local_tsl -Ibazel-out/k8-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers -Ibazel-out/k8-opt/bin/external/cuda_cudart/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cublas/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cccl/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_nvtx/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_nvcc/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cusolver/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cufft/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cusparse/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_curand/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_cupti/_virtual_includes/headers -Ibazel-out/k8-opt/bin/external/cuda_nvml/_virtual_includes/headers -Ibaroot@DESKTOP-199root@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mroot@DESKTOP-199P461:/mnt/c/Projekte/tmp/tensorflow#
@maludwig and @Venkat6871 is there a build that I can use? (like nightly build)
Hey @Nebolon , scroll up until you see "Script for WSL Ubuntu 22.04" in the comments.
The issue I raised is that there is no build, nightly or otherwise, that supports the latest Blackwell GPUs. I arguably managed to build one myself. You also could. But read through the script I put up above slowly. It looks like you missed some steps. HOPEFULLY the script I wrote will work for someone else, but since I got it working on an old dev box, rather than a brand fresh new black docker container or something, it's likely that I missed a dependency or two.
@maludwig dosen't work for me. echo "This should be Bazel v8.8.1" bazel --version here I get only 8.1.1 and I get some Errors for the build.
is there any chance when tensorflow will support the 5090 on its own and I can simply use the next version of tensorflow?
If so, please give me a date when.
Yes, I feel problem with NVIDIA RTX 5090 - 32GB Blackwell (not nightly version of PyTorch). I cannot see GPU with TensorFlow success. Can you take a look at https://gist.github.com/donhuvy/6cd637a09b034168d01181d5ce98a5fe . I catch Num GPUs Available: 0 . My environment: Windows 11 pro, JupyterLab latest version, Python 3.11.x .
wait so you got it to work 100%. it sucks my system has 2 5090's and i'm using a cpu for training.
Whenever I could not get drivers to work, it usually resolved after installing, reinstalling and changing versions of different packages, since the shortage I doubt there is overwhelming support for the 5090, I remember all launches to have crashing and minimal error bugs that disappear over a relatively short period of time.
I got stuck a while ago similarly on different cards and in general, it might be a tiny thing somewhere with your paths and env.
Try on Linux and see if that works, I don't know why you are using Windows as a senior Dev. My speeds on the 4090 doubled on render times for anything AI/ML related and loading times of nearly everything python vanished.
@maludwig dosen't work for me. echo "This should be Bazel v8.8.1" bazel --version here I get only 8.1.1 and I get some Errors for the build.
is there any chance when tensorflow will support the 5090 on its own and I can simply use the next version of tensorflow?
If so, please give me a date when.
Sorry @Nebolon , I'm not a tensorflow employee. I'm just some dude. Can't guess when it will be fixed. I just got my build to work and my personal projects running fine. My tests are failing and I assume that needs resolving.
If your Bazel version is wrong, try installing Bazelisk. See above for instructions.
wait so you got it to work 100%. it sucks my system has 2 5090's and i'm using a cpu for training.
Yep. For my workflow (training StableDiffusion LoRAs) it works fine. The tests are failing locally, but they must be testing tensorflow components that I am not using.
You could presumably try following in my footsteps and use your 5090s.
Try on Linux and see if that works, I don't know why you are using Windows as a senior Dev. My speeds on the 4090 doubled on render times for anything AI/ML related and loading times of nearly everything python vanished.
I'm also a senior dev, and while I agree in general that Linux is better and faster, Windows is still a perfectly legit OS. In fact, Apple Silicon is quite nice for training too. There's no distinction between RAM and VRAM in arm64a. Huge models run on consumer hardware. Not near as fast as on nvidia, but OSX is a legit OS too.
@maludwig I tried following the WSL script and got this error in the final step: "external/local_tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found". All previous steps were OK, such as the one building the .cu file using clang.
@jianingchen
What's your HERMETIC_CUDA_VERSION? It should be 12.8.1
Apart from that, maybe try cleaning the Bazel cache?
# Double check CUDA
echo "HERMETIC_CUDA_VERSION: $HERMETIC_CUDA_VERSION"
# I have trust issues with every cache thing
bazel clean --expunge
@maludwig It is 12.8.1 correctly. I also tried cleaning the Bazel cache, the error was same: header files in 'third_party/gpus/cuda/include' cannot be found.
@maludwig Some additional info: among the error verbose text, it displayed some environmental variables such as "LD_LIBRARY_PATH" "PATH", etc, but no "CPATH" can be seen. Could this be related to the issue?
@jianingchen find nvtx_utils.cc and try to change it to #include "cuda.h"
@maludwig I tried following the WSL script and got this error in the final step: "external/local_tsl/tsl/profiler/lib/nvtx_utils.cc:32:10: fatal error: 'third_party/gpus/cuda/include/cuda.h' file not found". All previous steps were OK, such as the one building the .cu file using clang.