XLA bug
i got these errros while running with config :
python /data/npl/Speech2Text/TensorFlowASR-main/examples/train.py --mxp=auto --jit-compile --config-path=/data/npl/Speech2Text/TensorFlowASR-main/examples/models/transducer/rnnt/small.yml.j2 --dataset-type=tfrecord --modeldir=/data/npl/Speech2Text/TensorFlowASR-main/tensorflow_asr/checkpoint --datadir=/data/npl/Speech2Text/TensorFlowASR-main/scripts/data
Epoch 1/300
INFO:tensorflow:Collective all_reduce tensors: 39 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1
INFO:tensorflow:Error reported to Coordinator: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
Traceback (most recent call last):
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
yield
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/distribute/mirrored_run.py", line 387, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 946, in _call
raise errors.UnimplementedError(
tensorflow.python.framework.errors_impl.UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
Traceback (most recent call last):
File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 110, in
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 317, in train_function *
return step_function(self, iterator, caching)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 304, in step_function *
outputs, caching = model.distribute_strategy.run(run_step, args=(data, caching))
UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.
i tried to use one gpu to train (A100) but its extremely slow. Can you please help .
您好,您的邮件我已收到。我会尽快给您回复。祝好!
my cuda and tensorflow version :
(/data/npl/Speech2Text/TensorFlowASR-main/venv) npl@uit-dgx01:/data/npl$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Feb__7_19:32:13_PST_2023 Cuda compilation tools, release 12.1, V12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0 (/data/npl/Speech2Text/TensorFlowASR-main/venv) npl@uit-dgx01:/data/npl$ pip show tensorflow Name: tensorflow Version: 2.15.0.post1 Summary: TensorFlow is an open source machine learning framework for everyone. Home-page: https://www.tensorflow.org/ Author: Google Inc. Author-email: [email protected] License: Apache 2.0 Location: /data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt Required-by: tensorflow-text, tf_kera
@itsmekhoathekid there's a newer version with tf v2.18 and keras v3 on branch feat-streaming.
Can you consider testing on that?