TensorFlowASR icon indicating copy to clipboard operation
TensorFlowASR copied to clipboard

XLA bug

Open itsmekhoathekid opened this issue 11 months ago • 3 comments

i got these errros while running with config :

python /data/npl/Speech2Text/TensorFlowASR-main/examples/train.py --mxp=auto --jit-compile --config-path=/data/npl/Speech2Text/TensorFlowASR-main/examples/models/transducer/rnnt/small.yml.j2 --dataset-type=tfrecord --modeldir=/data/npl/Speech2Text/TensorFlowASR-main/tensorflow_asr/checkpoint --datadir=/data/npl/Speech2Text/TensorFlowASR-main/scripts/data


Epoch 1/300 INFO:tensorflow:Collective all_reduce tensors: 39 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1 INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1 INFO:tensorflow:Collective all_reduce tensors: 1 all_reduces, num_devices = 8, group_size = 8, implementation = CommunicationImplementation.AUTO, num_packs = 1 INFO:tensorflow:Error reported to Coordinator: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function. Traceback (most recent call last): File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception yield File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/distribute/mirrored_run.py", line 387, in run self.main_result = self.main_fn(*self.main_args, **self.main_kwargs) File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 946, in _call raise errors.UnimplementedError( tensorflow.python.framework.errors_impl.UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function. Traceback (most recent call last): File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 110, in cli_util.run(main) File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/utils/cli_util.py", line 19, in run fire.Fire(component, command=command, name=name) File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/data/npl/Speech2Text/TensorFlowASR-main/examples/train.py", line 98, in main model.fit( File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 544, in fit tmp_logs, caching = self.train_function(iterator, caching=caching) File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler raise e.with_traceback(filtered_tb) from None File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler raise e.ag_error_metadata.to_exception(e) tensorflow.python.framework.errors_impl.UnimplementedError: in user code:

File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 317, in train_function  *
    return step_function(self, iterator, caching)
File "/data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages/tensorflow_asr/models/base_model.py", line 304, in step_function  *
    outputs, caching = model.distribute_strategy.run(run_step, args=(data, caching))

UnimplementedError: We failed to lift variable creations out of this tf.function, so this tf.function cannot be run on XLA. A possible workaround is to move variable creation outside of the XLA compiled function.

i tried to use one gpu to train (A100) but its extremely slow. Can you please help .

itsmekhoathekid avatar Mar 01 '25 09:03 itsmekhoathekid

您好,您的邮件我已收到。我会尽快给您回复。祝好!

Aegon007 avatar Mar 01 '25 09:03 Aegon007

my cuda and tensorflow version :

(/data/npl/Speech2Text/TensorFlowASR-main/venv) npl@uit-dgx01:/data/npl$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Feb__7_19:32:13_PST_2023 Cuda compilation tools, release 12.1, V12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0 (/data/npl/Speech2Text/TensorFlowASR-main/venv) npl@uit-dgx01:/data/npl$ pip show tensorflow Name: tensorflow Version: 2.15.0.post1 Summary: TensorFlow is an open source machine learning framework for everyone. Home-page: https://www.tensorflow.org/ Author: Google Inc. Author-email: [email protected] License: Apache 2.0 Location: /data/npl/Speech2Text/TensorFlowASR-main/venv/lib/python3.9/site-packages Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras, libclang, ml-dtypes, numpy, opt-einsum, packaging, protobuf, setuptools, six, tensorboard, tensorflow-estimator, tensorflow-io-gcs-filesystem, termcolor, typing-extensions, wrapt Required-by: tensorflow-text, tf_kera

itsmekhoathekid avatar Mar 01 '25 09:03 itsmekhoathekid

@itsmekhoathekid there's a newer version with tf v2.18 and keras v3 on branch feat-streaming. Can you consider testing on that?

nglehuy avatar Mar 13 '25 17:03 nglehuy