tensorflow Cannot disable XLA and/or JIT

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.15.0

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04.3

Mobile device

No response

Python version

3.10.12 (Unrelated)

Bazel version

6.3.2

GCC/compiler version

Clang 18

CUDA/cuDNN version

12.2

GPU model and memory

A100 and A10

Current behavior?

After upgrading from 2.12.1 to 2.15.0 we observed a lot new logs when starting a C++ service using Tensorflow, that it calls ptxas to compile some generated PTX, a snippet is attached below.

We tried a bunch of options in TF_XLA_FLAGS such as --tf_xla_auto_jit=-1, --tf_mlir_enable_mlir_bridge=0, --tf_xla_cpu_global_jit=0, --tf_xla_clustering_fuel=0, etc. but it still compiles those ops in the pass of CreateGpuKernelToBlobPass anyway.

I wonder if anything related to XLA / JIT changed between 2.12.1 and 2.15.0? And is there a way to simply disable all XLA and JIT?

BTW: Both Tensorflow versions were built with XLA and CUDA supports, with TF_CUDA_COMPUTE_CAPABILITIES="7.0,7.5,8.0,8.6,9.0".

Standalone code to reproduce the issue

It is a C++ service which loads model and do inference, we do not currently have a minimal example at this time. However, we can try providing detailed context as much as possible if we can narrow the scenario down to a considerable small portion of the service.

Relevant log output

2024-01-29 15:14:43.203921: I external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:565] Compile module main_kernel_0 time: 7.33 ms (cumulative: 82.4 ms, max: 9.23 ms, #called: 15)2024-01-29 15:14:43.204021: I external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:565] Compile module main_kernel time: 7.43 ms (cumulative: 89.9 ms, max: 9.23 ms, #called: 16)2024-01-29 15:14:43.204084: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:263] ptx written to: /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-d460ae06-1673-61010665dbba92024-01-29 15:14:43.204112: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:295] /usr/local/cuda-12.2/bin/ptxas /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-d460ae06-1673-61010665dbba9 -o /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-d460ae06-1673-61010665dbc05 -arch=sm_86 --warn-on-spills -v 2024-01-29 15:14:43.204182: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:263] ptx written to: /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-b79066f7-1673-61010665dbc0d2024-01-29 15:14:43.204209: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:295] /usr/local/cuda-12.2/bin/ptxas /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-b79066f7-1673-61010665dbc0d -o /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-b79066f7-1673-61010665dbc66 -arch=sm_86 --warn-on-spills -v2024-01-29 15:14:43.204334: I external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:565] Compile module main_kernel_1 time: 7.71 ms (cumulative: 97.6 ms, max: 9.23 ms, #called: 17)2024-01-29 15:14:43.204499: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:263] ptx written to: /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-7941e65c-1673-61010665dbd48
2024-01-29 15:14:43.204528: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:295] /usr/local/cuda-12.2/bin/ptxas /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-7941e65c-1673-61010665dbd48 -o /tmp/tempfile-jscs02-ai-deep-dev-4a10-01-7941e65c-1673-61010665dbda4 -arch=sm_86 --warn-on-spills -v2024-01-29 15:14:43.236003: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'ptxas info    : Function properties for main_kernel    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loadsptxas info    : Used 12 registers, 452 bytes cmem[0]
2024-01-29 15:14:43.236589: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'ptxas info    : Function properties for main_kernel    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 476 bytes cmem[0]

2024-01-29 15:14:43.237263: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'
ptxas info    : Function properties for main_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 508 bytes cmem[0]

2024-01-29 15:14:43.238685: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'
ptxas info    : Function properties for main_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 532 bytes cmem[0]

2024-01-29 15:14:43.240855: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'
ptxas info    : Function properties for main_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 15 registers, 564 bytes cmem[0]

2024-01-29 15:14:43.241720: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'
ptxas info    : Function properties for main_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 444 bytes cmem[0]

2024-01-29 15:14:43.241731: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'
ptxas info    : Function properties for main_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 444 bytes cmem[0]

2024-01-29 15:14:43.242423: I external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:333] ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'main_kernel' for 'sm_86'
ptxas info    : Function properties for main_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 452 bytes cmem[0]

libunwind: __unw_add_dynamic_fde: bad fde: FDE is really a CIE
2024-01-29 15:14:43.416923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:812] GpuDevice::ComputeHelper scheduled dense/clip_by_value_65 op Maximum on GPU 0 stream[0]

Jan 29 '24 16:01 wudisheng

Hi @wudisheng ,

Could you try this once

Set XLA_FLAGS=--xla_disabled_backends=cpu,gpu to disable XLA for both CPU and GPU.

Note: Disabling XLA and JIT might impact performance, so consider testing and benchmarking before deploying.

If problem is not resolved then please share a simple standalone code to reproduce the issue.

Thank you!

Jan 30 '24 09:01 Venkat6871

--xla_disabled_backends=cpu,gpu

It is not a valid XLA_FLAGS in 2.15.0, I got

2024-01-30 18:49:17.343195: F external/local_xla/xla/parse_flags_from_env.cc:210] Unknown flags in XLA_FLAGS: --xla_disabled_backends=cpu,gpu

BTW: I also tried --xla_gpu_disable_gpuasm_optimizations=true, --xla_backend_optimization_level=0 and --xla_gpu_autotune_level=0, but it still looks for ptxas and compile a bunch of "entry function 'main_kernel' ..."

Jan 30 '24 10:01 wudisheng

OOC, why do you want to disable XLA? This is not generally supported, a lot of features an some layers are XLA-only these days.

Feb 02 '24 18:02 cheshire

Generally we don't trust JIT --- we don't want our graph to be optimized or fused in any way beyond our SWE's knowledge. Performance is the last thing to concern about.

Actually we have been seeing a bunch more errors after a try of 2.15.0, including CUDA_ERROR_ILLEGAL_ADDRESS, etc. (when doing exactly the same thing as 2.12.1), so we are planning to stick with 2.12.1 for some time, if we can resolve some other issues when using 2.12.1 with CUDA 12.1/12.2.

Feb 04 '24 03:02 wudisheng

Overall this is not supported: if a TF function is annotated with @tf.function(jit_compile=True), there's no way to avoid JITing.

You can use the "fully eager mode" fallback, which would disable tf.function itself, that should be sufficient.

The statement

we don't want our graph to be optimized or fused in any way beyond our SWE's knowledge

already does not hold without XLA: Grappler pass runs a number of fusion/etc optimizations on TF graph even without XLA.

Feb 04 '24 09:02 cheshire

We don't use Python. What's the equivalent way of doing your suggestion in C++?

Feb 04 '24 10:02 wudisheng