NCCL + XLA fails for multi-GPU training.
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
tf 2.15
Custom code
Yes
OS platform and distribution
Linux Ubuntu 22.04
Mobile device
No response
Python version
3.11
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
12.2/8.9.4
GPU model and memory
A100 40GB (20GB MiG)
Current behavior?
I am trying to run multi-GPU training with an XLA compiled model (simple CNN with a classification head). Without XLA, everything runs fine. With XLA enabled, I get one of two errors in the log, depending on whether I am using 4 GPUs or 2 GPUs. The GPUs are split into 2 MiGs.
I also tried on previous TF/CUDA versions and I get the same result.
Standalone code to reproduce the issue
import argparse
import json
import os
from typing import Any
import numpy as np
import tensorflow as tf
def get_replica_hostnames():
...
def get_replica_id():
...
def set_multiworker_env_config():
hostnames = get_replica_hostnames()
replica_index = get_replica_id()
os.environ["TF_CONFIG"] = json.dumps(
{
"cluster": {
"worker": hostnames,
},
"task": {"type": "worker", "index": replica_index},
}
)
class Model(tf.keras.models.Model):
def __init__(self, *args: Any, **kwargs: Any) -> None:
super().__init__(*args, **kwargs)
self._embedder = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(
filters=8,
kernel_size=3,
padding="same",
activation=tf.keras.activations.relu,
use_bias=False,
),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Conv2D(
filters=8,
kernel_size=3,
padding="same",
activation=tf.keras.activations.relu,
use_bias=False,
),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.MaxPool2D(),
tf.keras.layers.GlobalAveragePooling2D(),
]
)
self._classifier = tf.keras.layers.Dense(550)
def call(self, x: tf.Tensor) -> tf.Tensor:
x = self._embedder(x)
x = self._classifier(x)
x = tf.keras.layers.Activation("linear", dtype="float32")(x)
return x
def create_dummy_dataset(batch_size: int) -> tf.data.Dataset:
X = np.random.rand(batch_size, 384, 640, 1)
y = np.random.randint(550, size=batch_size)
return tf.data.Dataset.from_tensor_slices((X, y)).batch(batch_size, True).repeat()
def train():
set_multiworker_env_config()
strategy = tf.distribute.MultiWorkerMirroredStrategy()
num_replicas = strategy.num_replicas_in_sync
batch_size = 16 * num_replicas
dataset = create_dummy_dataset(batch_size)
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Model()
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True,
),
optimizer=tf.keras.optimizers.Adam(
learning_rate=1e-3,
weight_decay=1e-5,
),
metrics=[
"accuracy",
],
jit_compile=True,
)
model.fit(
dataset,
epochs=10,
steps_per_epoch=100,
)
if __name__ == "__main__":
train()
Relevant log output
Log output for 4 GPUs:
2024-01-08 12:48:24.103746: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-08 12:48:24.103807: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-08 12:48:24.104896: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-08 12:48:24.111875: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-08 12:48:24.978525: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-08 12:48:27.492019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 18370 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:01:00.0, compute capability: 8.0
2024-01-08 12:48:27.503592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:worker/replica:0/task:0/device:GPU:0 with 18370 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:01:00.0, compute capability: 8.0
2024-01-08 12:48:27.528772: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://gen-svc-4e03d749-0565-47d7-b7ff-63437f0ab5b3:80
2024-01-08 12:48:27.535501: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 9493928235207696637
2024-01-08 12:48:27.535846: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:304] Coordination agent has successfully connected.
2024-01-08 12:48:28.425710: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 17537869834892823189
2024-01-08 12:48:29.509519: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:2 has connected to coordination service. Incarnation: 11924625857490357420
2024-01-08 12:48:29.766944: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:3 has connected to coordination service. Incarnation: 8010175117178506894
WARNING:absl:You use TensorFlow DType <dtype: 'string'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to object.
WARNING:absl:You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:563 [0] NCCL INFO Bootstrap : Using eth0:10.233.118.112<0>
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:563 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
2024-01-08 12:48:32.756935: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-01-08 12:48:32.756993: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-01-08 12:48:32.757173: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1883] Profiler found 1 GPUs
2024-01-08 12:48:32.790801: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-01-08 12:48:32.790934: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:2017] CUPTI activity buffer flushed
Epoch 1/10
2024-01-08 12:48:37.520453: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fe284006a40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-08 12:48:37.520583: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA A100-SXM4-40GB MIG 3g.20gb, Compute Capability 8.0
2024-01-08 12:48:37.680983: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-08 12:48:38.693930: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1704718137.764316 430 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.16.5+cudaCUDA_MAJOR.CUDA_MINOR
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO Failed to open libibverbs.so[.1]
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.118.112<0>
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO Using network Socket
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] external/nccl_archive/src/init.cc:642 NCCL WARN Duplicate GPU detected : rank 0 and rank 2 both on CUDA device 1000
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO external/nccl_archive/src/init.cc:1100 -> 5
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO external/nccl_archive/src/init.cc:1173 -> 5
ml-wf-receipt-ext-logo-classifier-pipelinenxxxs-train-template:37:1288 [0] NCCL INFO external/nccl_archive/src/init.cc:1209 -> 5
2024-01-08 12:48:58.577448: W external/local_xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/local_xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: invalid usage
2024-01-08 12:48:58.577775: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16898057275935290807
2024-01-08 12:48:58.577798: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7386102362502530449
2024-01-08 12:48:58.577844: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11220456033729140565
Traceback (most recent call last):
File "/kirax_source/train.py", line 346, in <module>
train_tf(args=args, jit_compile=XLA)
File "/kirax_source/train.py", line 246, in train_tf
model.fit(
File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node StatefulPartitionedCall defined at (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
Detected at node StatefulPartitionedCall defined at (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
2 root error(s) found.
(0) INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/local_xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: invalid usage; current tracing scope: all-reduce-start.4; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
[[{{node StatefulPartitionedCall}}]]
[[Reshape_3/_22]]
(1) INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/local_xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: invalid usage; current tracing scope: all-reduce-start.4; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
[[{{node StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_7900]
Log output for 2 GPUs:
2024-01-08 13:32:56.555585: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-08 13:32:56.555666: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-08 13:32:56.557055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-08 13:32:56.563733: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-08 13:32:57.475551: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-08 13:32:59.868467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 18370 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:41:00.0, compute capability: 8.0
2024-01-08 13:32:59.880625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:worker/replica:0/task:0/device:GPU:0 with 18370 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB MIG 3g.20gb, pci bus id: 0000:41:00.0, compute capability: 8.0
2024-01-08 13:32:59.903354: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://gen-svc-d871b43e-6f9e-4d8f-9faf-d98a734319f3:80
2024-01-08 13:32:59.912608: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:0 has connected to coordination service. Incarnation: 12924650074147221766
2024-01-08 13:32:59.913045: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service_agent.cc:304] Coordination agent has successfully connected.
2024-01-08 13:33:00.820050: I external/local_tsl/tsl/distributed_runtime/coordination/coordination_service.cc:553] /job:worker/replica:0/task:1 has connected to coordination service. Incarnation: 4023732460757352804
WARNING:absl:You use TensorFlow DType <dtype: 'string'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to object.
WARNING:absl:You use TensorFlow DType <dtype: 'int64'> in tfds.features This will soon be deprecated in favor of NumPy DTypes. In the meantime it was converted to int64.
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:784 [0] NCCL INFO Bootstrap : Using eth0:10.233.118.63<0>
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:784 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
2024-01-08 13:33:03.552619: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2024-01-08 13:33:03.552667: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2024-01-08 13:33:03.552814: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:1883] Profiler found 1 GPUs
2024-01-08 13:33:03.587433: I external/local_tsl/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2024-01-08 13:33:03.587611: I external/local_xla/xla/backends/profiler/gpu/cupti_tracer.cc:2017] CUPTI activity buffer flushed
Epoch 1/10
2024-01-08 13:33:08.149597: I external/local_xla/xla/service/service.cc:168] XLA service 0x7ff338008540 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-08 13:33:08.149730: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA A100-SXM4-40GB MIG 3g.20gb, Compute Capability 8.0
2024-01-08 13:33:08.732338: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-01-08 13:33:09.871101: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1704720809.106117 425 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.16.5+cudaCUDA_MAJOR.CUDA_MINOR
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Failed to open libibverbs.so[.1]
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO NET/Socket : Using [0]eth0:10.233.118.63<0>
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Using network Socket
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] external/nccl_archive/src/misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 00/02 : 0 1
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 01/02 : 0 1
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO P2P Chunksize set to 131072
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1284 [0] NCCL INFO NCCL_SOCKET_NTHREADS set by environment to 8.
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1284 [0] NCCL INFO NET/Socket: Using 8 threads and 1 sockets per thread
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 00/0 : 1[1000] -> 0[41000] [receive] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1284 [0] NCCL INFO NET/Socket: Using 8 threads and 1 sockets per thread
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 01/0 : 1[1000] -> 0[41000] [receive] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 00/0 : 0[41000] -> 1[1000] [send] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Channel 01/0 : 0[41000] -> 1[1000] [send] via NET/Socket/0
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Connected all rings
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO Connected all trees
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
ml-wf-receipt-ext-logo-classifier-pipelinegl897-train-template:31:1280 [0] NCCL INFO comm 0x7fe61cffcb20 rank 0 nranks 2 cudaDev 0 busId 41000 commId 0xe105507e5746b5a2 - Init COMPLETE
2024-01-08 13:33:29.780117: W external/local_xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: There was an error before calling cuModuleGetFunction (101): cudaErrorInvalidDevice : invalid device ordinal
2024-01-08 13:33:29.780331: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14229809023376429067
2024-01-08 13:33:29.780349: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10883816187180422187
2024-01-08 13:33:29.780389: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 5243788738829094043
Traceback (most recent call last):
File "/kirax_source/train.py", line 346, in <module>
train_tf(args=args, jit_compile=XLA)
File "/kirax_source/train.py", line 246, in train_tf
model.fit(
File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node StatefulPartitionedCall defined at (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
Detected at node StatefulPartitionedCall defined at (most recent call last):
File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
2 root error(s) found.
(0) INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: There was an error before calling cuModuleGetFunction (101): cudaErrorInvalidDevice : invalid device ordinal; current tracing scope: fusion.274; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
[[{{node StatefulPartitionedCall}}]]
[[CollectiveReduceV2_1/_17]]
(1) INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.func.launch' failed: There was an error before calling cuModuleGetFunction (101): cudaErrorInvalidDevice : invalid device ordinal; current tracing scope: fusion.274; current profiling annotation: XlaModule:#hlo_module=a_inference_run_step_7562__.4006,program_id=447#.
[[{{node StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_7900]
Hi @cyanic-selkie ,
Could you please provide complete code snippet required for replication of this problem. Thanks!
Sorry for the delay, @SuryanarayanaY, I updated my issue with the full code snippet.
Hi @cyanic-selkie ,
As you are using TF2.15 version with tf.keras,first you need to import tf-keras package using pip install tf-keras
and then set environment variable os.environ["TF_USE_LEGACY_KERAS"]="1".
Then you can import keras from tensorflow or simply use tf.keras.
Can you try this and comeback with outcome. Thanks!
@SuryanarayanaY I did as you said, there is no difference.
Hi @cyanic-selkie ,
Could you please provide minimal code snippet for testing? Thanks!
@SuryanarayanaY I already provided the minimum reproducible example in the initial post. Is there something else you need?
same bug 'xla.gpu.all_reduce' failed, i found that "tensorflow.keras.applications.inception_resnet_v2.InceptionResNetV2" was called before all_reduce, the problem occur, but if i use dense layer as replacement, it work well.... sb. help me plz
i guess the former batchnormalize layer called cause the later all_reduce crash as the code above