tensorflow NCCL + XLA fails for multi-GPU training.

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf 2.15

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04

Mobile device

No response

Python version

3.11

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

12.2/8.9.4

GPU model and memory

A100 40GB (20GB MiG)

Current behavior?

I am trying to run multi-GPU training with an XLA compiled model (simple CNN with a classification head). Without XLA, everything runs fine. With XLA enabled, I get one of two errors in the log, depending on whether I am using 4 GPUs or 2 GPUs. The GPUs are split into 2 MiGs.

I also tried on previous TF/CUDA versions and I get the same result.

Standalone code to reproduce the issue

import argparse
import json
import os
from typing import Any

import numpy as np
import tensorflow as tf

def get_replica_hostnames():
    ...


def get_replica_id():
    ...


def set_multiworker_env_config():
    hostnames = get_replica_hostnames()
    replica_index = get_replica_id()

    os.environ["TF_CONFIG"] = json.dumps(
        {
            "cluster": {
                "worker": hostnames,
            },
            "task": {"type": "worker", "index": replica_index},
        }
    )


class Model(tf.keras.models.Model):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)

        self._embedder = tf.keras.Sequential(
            [
                tf.keras.layers.Conv2D(
                    filters=8,
                    kernel_size=3,
                    padding="same",
                    activation=tf.keras.activations.relu,
                    use_bias=False,
                ),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Conv2D(
                    filters=8,
                    kernel_size=3,
                    padding="same",
                    activation=tf.keras.activations.relu,
                    use_bias=False,
                ),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.MaxPool2D(),
                tf.keras.layers.GlobalAveragePooling2D(),
            ]
        )

        self._classifier = tf.keras.layers.Dense(550)

    def call(self, x: tf.Tensor) -> tf.Tensor:
        x = self._embedder(x)
        x = self._classifier(x)
        x = tf.keras.layers.Activation("linear", dtype="float32")(x)
        return x


def create_dummy_dataset(batch_size: int) -> tf.data.Dataset:
    X = np.random.rand(batch_size, 384, 640, 1)
    y = np.random.randint(550, size=batch_size)
    return tf.data.Dataset.from_tensor_slices((X, y)).batch(batch_size, True).repeat()


def train():
    set_multiworker_env_config()
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    num_replicas = strategy.num_replicas_in_sync

    batch_size = 16 * num_replicas
    dataset = create_dummy_dataset(batch_size)
    dataset = strategy.experimental_distribute_dataset(dataset)

    with strategy.scope():
        model = Model()
        model.compile(
            loss=tf.keras.losses.SparseCategoricalCrossentropy(
                from_logits=True,
            ),
            optimizer=tf.keras.optimizers.Adam(
                learning_rate=1e-3,
                weight_decay=1e-5,
            ),
            metrics=[
                "accuracy",
            ],
            jit_compile=True,
        )

    model.fit(
        dataset,
        epochs=10,
        steps_per_epoch=100,
    )


if __name__ == "__main__":
    train()

Relevant log output

Log output for 4 GPUs:

2024-01-08 12:48:24.103746: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-08 12:48:24.103807: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-08 12:48:24.104896: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-08 12:48:24.111875: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-08 12:48:24.978525: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-08 12:48:27.492019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 18370 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:01:00.0, compute capability: 8.0
2024-01-08 12:48:27.503592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:worker/replica:0/task:0/device:GPU:0 with 18370 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:01:00.0, compute capability: 8.0
2024-01-08 12:48:27.528772: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://gen-svc-4e03d749-0565-47d7-b7ff-63437f0ab5b3:80
2024-01-08 12:48:27.535501: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 9493928235207696637
2024-01-08 12:48:27.535846: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:304] Coordination agent has successfully connected.
2024-01-08 12:48:28.425710: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 17537869834892823189
2024-01-08 12:48:29.509519: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:2 has connected to coordination service. Incarnation: 11924625857490357420
2024-01-08 12:48:29.766944: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:3 has connected to coordination service. Incarnation: 8010175117178506894
WARNING:absl:You use TensorFlow DType <dtype: 'string'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to object.
WARNING:absl:You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:563 [0] NCCL INFO Bootstrap : Using eth0:10.233.118.112<0>
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:563 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
2024-01-08 12:48:32.756935: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-01-08 12:48:32.756993: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-01-08 12:48:32.757173: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1883] Profiler found 1 GPUs
2024-01-08 12:48:32.790801: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-01-08 12:48:32.790934: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:2017] CUPTI activity buffer flushed
Epoch 1/10
2024-01-08 12:48:37.520453: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fe284006a40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-08 12:48:37.520583: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB MIG 3g.20gb, Compute Capability 8.0
2024-01-08 12:48:37.680983: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-08 12:48:38.693930: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1704718137.764316     430 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.16.5+cudaCUDA_MAJOR.CUDA_MINOR
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO Failed to open libibverbs.so[.1]
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.118.112<0>
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO Using network Socket

ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] external/nccl_archive/src/init.cc:642 NCCL WARN Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 1000
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO external/nccl_archive/src/init.cc:1100 -> 5
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO external/nccl_archive/src/init.cc:1173 -> 5
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO external/nccl_archive/src/init.cc:1209 -> 5
2024-01-08 12:48:58.577448: W external/local_xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/local_xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: invalid usage
2024-01-08 12:48:58.577775: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16898057275935290807
2024-01-08 12:48:58.577798: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7386102362502530449
2024-01-08 12:48:58.577844: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11220456033729140565
Traceback (most recent call last):
  File "/kirax_source/train.py", line 346, in <module>
    train_tf(args=args, jit_compile=XLA)
  File "/kirax_source/train.py", line 246, in train_tf
    model.fit(
  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap

  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap

  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner

2 root error(s) found.
  (0) INTERNAL:  Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/local_xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: invalid usage; current tracing scope: all-reduce-start.4; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
         [[{{node StatefulPartitionedCall}}]]
         [[Reshape_3/_22]]
  (1) INTERNAL:  Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/local_xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: invalid usage; current tracing scope: all-reduce-start.4; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
         [[{{node StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_7900]

Log output for 2 GPUs:

2024-01-08 13:32:56.555585: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-08 13:32:56.555666: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-08 13:32:56.557055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-08 13:32:56.563733: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-08 13:32:57.475551: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-08 13:32:59.868467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 18370 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:41:00.0, compute capability: 8.0
2024-01-08 13:32:59.880625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:worker/replica:0/task:0/device:GPU:0 with 18370 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:41:00.0, compute capability: 8.0
2024-01-08 13:32:59.903354: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://gen-svc-d871b43e-6f9e-4d8f-9faf-d98a734319f3:80
2024-01-08 13:32:59.912608: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 12924650074147221766
2024-01-08 13:32:59.913045: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:304] Coordination agent has successfully connected.
2024-01-08 13:33:00.820050: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 4023732460757352804
WARNING:absl:You use TensorFlow DType <dtype: 'string'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to object.
WARNING:absl:You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:784 [0] NCCL INFO Bootstrap : Using eth0:10.233.118.63<0>
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:784 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
2024-01-08 13:33:03.552619: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-01-08 13:33:03.552667: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-01-08 13:33:03.552814: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1883] Profiler found 1 GPUs
2024-01-08 13:33:03.587433: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-01-08 13:33:03.587611: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:2017] CUPTI activity buffer flushed
Epoch 1/10
2024-01-08 13:33:08.149597: I external/local_xla/xla/service/service.cc:168] XLA service 0x7ff338008540 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-08 13:33:08.149730: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB MIG 3g.20gb, Compute Capability 8.0
2024-01-08 13:33:08.732338: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-08 13:33:09.871101: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1704720809.106117     425 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.16.5+cudaCUDA_MAJOR.CUDA_MINOR
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Failed to open libibverbs.so[.1]
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.118.63<0>
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Using network Socket

ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] external/nccl_archive/src/misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 00/02 :    0   1
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 01/02 :    0   1
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO P2P Chunksize set to 131072
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1284 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 8.
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1284 [0] NCCL INFO NET/Socket: Using 8 threads and 1 sockets per thread
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 00/0 : 1[1000] -> 0[41000] [receive] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1284 [0] NCCL INFO NET/Socket: Using 8 threads and 1 sockets per thread
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 01/0 : 1[1000] -> 0[41000] [receive] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 00/0 : 0[41000] -> 1[1000] [send] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 01/0 : 0[41000] -> 1[1000] [send] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Connected all rings
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Connected all trees
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO comm 0x7fe61cffcb20 rank 0 nranks 2 cudaDev 0 busId 41000 commId 0xe105507e5746b5a2 - Init COMPLETE
2024-01-08 13:33:29.780117: W external/local_xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: There was an error before calling cuModuleGetFunction (101): cudaErrorInvalidDevice : invalid device ordinal
2024-01-08 13:33:29.780331: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14229809023376429067
2024-01-08 13:33:29.780349: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10883816187180422187
2024-01-08 13:33:29.780389: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 5243788738829094043
Traceback (most recent call last):
  File "/kirax_source/train.py", line 346, in <module>
    train_tf(args=args, jit_compile=XLA)
  File "/kirax_source/train.py", line 246, in train_tf
    model.fit(
  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap

  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap

  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner

2 root error(s) found.
  (0) INTERNAL:  Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: There was an error before calling cuModuleGetFunction (101): cudaErrorInvalidDevice : invalid device ordinal; current tracing scope: fusion.274; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
         [[{{node StatefulPartitionedCall}}]]
         [[CollectiveReduceV2_1/_17]]
  (1) INTERNAL:  Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: There was an error before calling cuModuleGetFunction (101): cudaErrorInvalidDevice : invalid device ordinal; current tracing scope: fusion.274; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
         [[{{node StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_7900]

Jan 08 '24 13:01 cyanic-selkie

Hi @cyanic-selkie ,

Could you please provide complete code snippet required for replication of this problem. Thanks!

Jan 09 '24 06:01 SuryanarayanaY

Sorry for the delay, @SuryanarayanaY, I updated my issue with the full code snippet.

Jan 11 '24 12:01 cyanic-selkie

Hi @cyanic-selkie ,

As you are using TF2.15 version with tf.keras,first you need to import tf-keras package using pip install tf-keras and then set environment variable os.environ["TF_USE_LEGACY_KERAS"]="1".

Then you can import keras from tensorflow or simply use tf.keras.

Can you try this and comeback with outcome. Thanks!

Jan 19 '24 10:01 SuryanarayanaY

@SuryanarayanaY I did as you said, there is no difference.

Jan 19 '24 11:01 cyanic-selkie

Hi @cyanic-selkie ,

Could you please provide minimal code snippet for testing? Thanks!

Jan 30 '24 09:01 SuryanarayanaY

@SuryanarayanaY I already provided the minimum reproducible example in the initial post. Is there something else you need?

Jan 30 '24 10:01 cyanic-selkie

same bug 'xla.gpu.all_reduce' failed, i found that "tensorflow.keras.applications.inception_resnet_v2.InceptionResNetV2" was called before all_reduce, the problem occur, but if i use dense layer as replacement, it work well.... sb. help me plz

Mar 11 '24 12:03 fjmscut

i guess the former batchnormalize layer called cause the later all_reduce crash as the code above

Mar 12 '24 03:03 fjmscut