Issues in XLA compilation while running on TPU
Hi, I am using your code to train Conformer using Cloud TPU VM on GCP. Have you managed to run the conformer code on TPU and maybe you could share some details on your setup? Or maybe you came across similar issues and have some pointers on how to attempt to debug this?
I've encountered this issue:
2021-07-27 14:46:04.911632: I tensorflow/core/tpu/graph_rewrite/encapsulate_tpu_computations_pass.cc:263] Subgraph fingerprint:1868146904613528940
2021-07-27 14:46:12.541836: I tensorflow/core/tpu/kernels/tpu_compilation_cache_interface.cc:435] TPU host compilation cache miss: cache_key(7086181526061168380), session_name()
2021-07-27 14:46:14.141350: I tensorflow/core/tpu/kernels/tpu_compile_op_common.cc:66] TpuCompileOp was cancelled. Sleeping for 300 seconds to give time for TPUCompileOp to finished.
2021-07-27 14:46:17.250612: I tensorflow/core/tpu/kernels/tpu_compile_op_common.cc:175] Compilation of 7086181526061168380 with session name took 4.708620071s and failed
2021-07-27 14:46:17.250711: F tensorflow/core/tpu/kernels/tpu_program_group.cc:86] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7f59d9a1d18b,7f59d9d843bf,7f57a93bbcaf,7f57a9330c2c,7f57a93cb438,7f57a93cbf75,7f57a93c2967,7f57a93c466a,7f579e7466b3,7f579e73a22d,7f57a931b5d0,7f57a9319562,7f579ebe2b16,7f59d9d78608&map=11215d705c5c1891344a2fbb04a963de:7f579f89b000-7f57bb1e2f06,ae24a0835085e125198a198c7eab68d6:7f579dd7e000-7f579f6008de
*** SIGABRT received by PID 67413 (TID 68201) on cpu 89 from PID 67413; stack trace: ***
PC: @ 0x7f59d9a1d18b (unknown) raise
@ 0x7f579d26c1e0 976 (unknown)
@ 0x7f59d9d843c0 3920 (unknown)
@ 0x7f57a93bbcb0 944 tensorflow::tpu::TpuProgramGroup::Initialize()
@ 0x7f57a9330c2d 1696 tensorflow::tpu::TpuCompilationCacheExternal::InitializeEntry()
@ 0x7f57a93cb439 1168 tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsentHelper()
@ 0x7f57a93cbf76 128 tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsent()
@ 0x7f57a93c2968 1280 tensorflow::tpu::TpuCompileOpKernelCommon::ComputeInternal()
@ 0x7f57a93c466b 608 tensorflow::tpu::TpuCompileOpKernelCommon::Compute()
@ 0x7f579e7466b4 2448 tensorflow::(anonymous namespace)::ExecutorState<>::Process()
@ 0x7f579e73a22e 48 std::_Function_handler<>::_M_invoke()
@ 0x7f57a931b5d1 144 Eigen::ThreadPoolTempl<>::WorkerLoop()
@ 0x7f57a9319563 64 std::_Function_handler<>::_M_invoke()
@ 0x7f579ebe2b17 96 tensorflow::(anonymous namespace)::PThread::ThreadFn()
@ 0x7f59d9d78609 (unknown) start_thread
https://symbolize.stripped_domain/r/?trace=7f59d9a1d18b,7f579d26c1df,7f59d9d843bf,7f57a93bbcaf,7f57a9330c2c,7f57a93cb438,7f57a93cbf75,7f57a93c2967,7f57a93c466a,7f579e7466b3,7f579e73a22d,7f57a931b5d0,7f57a9319562,7f579ebe2b16,7f59d9d78608&map=11215d705c5c1891344a2fbb04a963de:7f579f89b000-7f57bb1e2f06,ae24a0835085e125198a198c7eab68d6:7f579dd7e000-7f579f6008de,ca1b7ab241ee28147b3d590cadb5dc1b:7f579056d000-7f579d59fb20
E0727 14:46:17.881327 68201 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0727 14:46:17.881380 68201 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0727 14:46:17.881402 68201 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0727 14:46:17.881411 68201 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0727 14:46:17.881421 68201 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0727 14:46:17.881459 68201 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0727 14:46:17.881473 68201 coredump_hook.cc:525] RAW: Discarding core.
E0727 14:46:18.753614 68201 process_state.cc:771] RAW: Raising signal 6 with default behavior
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
I'm using a slightly edited training script from your examples:
import argparse
import math
import os
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.utils import env_util
from experiments.ml.specaugment.conformer.dataset import SnapshotDataset
logger = env_util.setup_environment()
import tensorflow as tf
DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")
tf.keras.backend.clear_session()
parser = argparse.ArgumentParser(prog="Conformer Training")
parser.add_argument("--config", type=str, default=DEFAULT_YAML, help="The file path of model configuration file")
parser.add_argument("--tfrecords", default=False, action="store_true", help="Whether to use tfrecords")
parser.add_argument("--sentence_piece", default=False, action="store_true", help="Whether to use `SentencePiece` model")
parser.add_argument("--subwords", default=False, action="store_true", help="Use subwords")
parser.add_argument("--bs", type=int, default=None, help="Batch size per replica")
parser.add_argument("--spx", type=int, default=1, help="Steps per execution for maximizing performance")
parser.add_argument("--metadata", type=str, default=None, help="Path to file containing metadata")
parser.add_argument("--static_length", default=False, action="store_true", help="Use static lengths")
parser.add_argument("--devices", type=int, nargs="*", default=[0], help="Devices' ids to apply distributed training")
parser.add_argument("--mxp", default=False, action="store_true", help="Enable mixed precision")
parser.add_argument("--pretrained", type=str, default=None, help="Path to pretrained model")
args = parser.parse_args()
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": args.mxp})
class LocalTPUClusterResolver(
tf.distribute.cluster_resolver.TPUClusterResolver):
"""LocalTPUClusterResolver."""
def __init__(self):
self._tpu = ''
self.task_type = 'worker'
self.task_id = 0
def master(self, task_type=None, task_id=None, rpc_layer=None):
return None
def cluster_spec(self):
return tf.train.ClusterSpec({})
def get_tpu_system_metadata(self):
return tf.tpu.experimental.TPUSystemMetadata(
num_cores=8,
num_hosts=1,
num_of_cores_per_host=8,
topology=None,
devices=tf.config.list_logical_devices())
def num_accelerators(self,
task_type=None,
task_id=None,
config_proto=None):
return {'TPU': 8}
def setup_tpu():
resolver = LocalTPUClusterResolver()
# resolver = tf.distribute.cluster_resolver.TPUClusterResolver('local')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
logger.info('Using TPU Strategy.')
return strategy
strategy = setup_tpu()
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.configs.config import Config
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.featurizers import speech_featurizers, text_featurizers
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.models.transducer.conformer import Conformer
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.optimizers.schedules import TransformerSchedule
config = Config(args.config)
speech_featurizer = speech_featurizers.TFSpeechFeaturizer(config.speech_config)
if args.sentence_piece:
logger.info("Loading SentencePiece model ...")
text_featurizer = text_featurizers.SentencePieceFeaturizer(config.decoder_config)
elif args.subwords:
logger.info("Loading subwords ...")
text_featurizer = text_featurizers.SubwordFeaturizer(config.decoder_config)
else:
logger.info("Use characters ...")
text_featurizer = text_featurizers.CharFeaturizer(config.decoder_config)
train_dataset = SnapshotDataset(
speech_featurizer=speech_featurizer,
text_featurizer=text_featurizer,
**vars(config.learning_config.train_dataset_config),
indefinite=False,
num_elems_to_load=5000,
pipeline=[],
caching_period=0,
snapshot_path=None,
service_ip=None,
wav=True,
repeat_single_batch=True,
)
eval_dataset = SnapshotDataset(
speech_featurizer=speech_featurizer,
text_featurizer=text_featurizer,
**vars(config.learning_config.eval_dataset_config),
indefinite=False,
num_elems_to_load=5000,
pipeline=[],
caching_period=0,
snapshot_path=None,
service_ip=None,
wav=True,
repeat_single_batch=True,
)
metadata = {
"max_input_length": 2453,
"max_label_length": 398,
"num_entries": 28539
}
train_dataset.load_metadata(metadata)
eval_dataset.load_metadata(metadata)
# if not args.static_length:
# speech_featurizer.reset_length()
# text_featurizer.reset_length()
global_batch_size = args.bs or config.learning_config.running_config.batch_size
global_batch_size *= 8#strategy.num_replicas_in_sync
train_data_loader = train_dataset.create(global_batch_size)
eval_data_loader = eval_dataset.create(global_batch_size)
with strategy.scope():
# build model
conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
conformer.make(
speech_featurizer.shape,
prediction_shape=text_featurizer.prepand_shape,
batch_size=global_batch_size
)
if args.pretrained:
conformer.load_weights(args.pretrained, by_name=True, skip_mismatch=True)
conformer.summary(line_length=100)
optimizer = tf.keras.optimizers.Adam(
TransformerSchedule(
d_model=conformer.dmodel,
warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
max_lr=(0.05 / math.sqrt(conformer.dmodel))
),
**config.learning_config.optimizer_config
)
conformer.compile(
optimizer=optimizer,
experimental_steps_per_execution=args.spx,
global_batch_size=global_batch_size,
blank=text_featurizer.blank
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard)
]
conformer.fit(
train_data_loader,
epochs=config.learning_config.running_config.num_epochs,
validation_data=eval_data_loader,
callbacks=callbacks,
steps_per_epoch=train_dataset.total_steps,
validation_steps=eval_dataset.total_steps if eval_data_loader else None
)
Snapshot dataset is a class very similar to ASRDataset with slight changes. I'm using Librispeech for the dataset.
I've managed to run models on tpu colab with gcp You can try running the script on tpu colab to see if it works Make sure you use the supported dtype (tpu does not support tf.string, so every data points in the tfdataset must be number), fixed shape (with padding for variant sizes)
I’ll close the issue here due to inactivity. Feel free to reopen if you have further questions.