TensorFlowASR icon indicating copy to clipboard operation
TensorFlowASR copied to clipboard

Issues in XLA compilation while running on TPU

Open lamyiowce opened this issue 4 years ago • 1 comments

Hi, I am using your code to train Conformer using Cloud TPU VM on GCP. Have you managed to run the conformer code on TPU and maybe you could share some details on your setup? Or maybe you came across similar issues and have some pointers on how to attempt to debug this?

I've encountered this issue:

2021-07-27 14:46:04.911632: I tensorflow/core/tpu/graph_rewrite/encapsulate_tpu_computations_pass.cc:263] Subgraph fingerprint:1868146904613528940
2021-07-27 14:46:12.541836: I tensorflow/core/tpu/kernels/tpu_compilation_cache_interface.cc:435] TPU host compilation cache miss: cache_key(7086181526061168380), session_name()
2021-07-27 14:46:14.141350: I tensorflow/core/tpu/kernels/tpu_compile_op_common.cc:66] TpuCompileOp was cancelled. Sleeping for 300 seconds to give time for TPUCompileOp to finished.
2021-07-27 14:46:17.250612: I tensorflow/core/tpu/kernels/tpu_compile_op_common.cc:175] Compilation of 7086181526061168380 with session name  took 4.708620071s and failed
2021-07-27 14:46:17.250711: F tensorflow/core/tpu/kernels/tpu_program_group.cc:86] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7f59d9a1d18b,7f59d9d843bf,7f57a93bbcaf,7f57a9330c2c,7f57a93cb438,7f57a93cbf75,7f57a93c2967,7f57a93c466a,7f579e7466b3,7f579e73a22d,7f57a931b5d0,7f57a9319562,7f579ebe2b16,7f59d9d78608&map=11215d705c5c1891344a2fbb04a963de:7f579f89b000-7f57bb1e2f06,ae24a0835085e125198a198c7eab68d6:7f579dd7e000-7f579f6008de 
*** SIGABRT received by PID 67413 (TID 68201) on cpu 89 from PID 67413; stack trace: ***
PC: @     0x7f59d9a1d18b  (unknown)  raise
    @     0x7f579d26c1e0        976  (unknown)
    @     0x7f59d9d843c0       3920  (unknown)
    @     0x7f57a93bbcb0        944  tensorflow::tpu::TpuProgramGroup::Initialize()
    @     0x7f57a9330c2d       1696  tensorflow::tpu::TpuCompilationCacheExternal::InitializeEntry()
    @     0x7f57a93cb439       1168  tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsentHelper()
    @     0x7f57a93cbf76        128  tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsent()
    @     0x7f57a93c2968       1280  tensorflow::tpu::TpuCompileOpKernelCommon::ComputeInternal()
    @     0x7f57a93c466b        608  tensorflow::tpu::TpuCompileOpKernelCommon::Compute()
    @     0x7f579e7466b4       2448  tensorflow::(anonymous namespace)::ExecutorState<>::Process()
    @     0x7f579e73a22e         48  std::_Function_handler<>::_M_invoke()
    @     0x7f57a931b5d1        144  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f57a9319563         64  std::_Function_handler<>::_M_invoke()
    @     0x7f579ebe2b17         96  tensorflow::(anonymous namespace)::PThread::ThreadFn()
    @     0x7f59d9d78609  (unknown)  start_thread
https://symbolize.stripped_domain/r/?trace=7f59d9a1d18b,7f579d26c1df,7f59d9d843bf,7f57a93bbcaf,7f57a9330c2c,7f57a93cb438,7f57a93cbf75,7f57a93c2967,7f57a93c466a,7f579e7466b3,7f579e73a22d,7f57a931b5d0,7f57a9319562,7f579ebe2b16,7f59d9d78608&map=11215d705c5c1891344a2fbb04a963de:7f579f89b000-7f57bb1e2f06,ae24a0835085e125198a198c7eab68d6:7f579dd7e000-7f579f6008de,ca1b7ab241ee28147b3d590cadb5dc1b:7f579056d000-7f579d59fb20 
E0727 14:46:17.881327   68201 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0727 14:46:17.881380   68201 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0727 14:46:17.881402   68201 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0727 14:46:17.881411   68201 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0727 14:46:17.881421   68201 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0727 14:46:17.881459   68201 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0727 14:46:17.881473   68201 coredump_hook.cc:525] RAW: Discarding core.
E0727 14:46:18.753614   68201 process_state.cc:771] RAW: Raising signal 6 with default behavior

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

I'm using a slightly edited training script from your examples:


import argparse
import math
import os

from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.utils import env_util
from experiments.ml.specaugment.conformer.dataset import SnapshotDataset

logger = env_util.setup_environment()
import tensorflow as tf

DEFAULT_YAML = os.path.join(os.path.abspath(os.path.dirname(__file__)), "config.yml")

tf.keras.backend.clear_session()

parser = argparse.ArgumentParser(prog="Conformer Training")

parser.add_argument("--config", type=str, default=DEFAULT_YAML, help="The file path of model configuration file")
parser.add_argument("--tfrecords", default=False, action="store_true", help="Whether to use tfrecords")
parser.add_argument("--sentence_piece", default=False, action="store_true", help="Whether to use `SentencePiece` model")
parser.add_argument("--subwords", default=False, action="store_true", help="Use subwords")
parser.add_argument("--bs", type=int, default=None, help="Batch size per replica")
parser.add_argument("--spx", type=int, default=1, help="Steps per execution for maximizing performance")
parser.add_argument("--metadata", type=str, default=None, help="Path to file containing metadata")
parser.add_argument("--static_length", default=False, action="store_true", help="Use static lengths")
parser.add_argument("--devices", type=int, nargs="*", default=[0], help="Devices' ids to apply distributed training")
parser.add_argument("--mxp", default=False, action="store_true", help="Enable mixed precision")
parser.add_argument("--pretrained", type=str, default=None, help="Path to pretrained model")

args = parser.parse_args()

tf.config.optimizer.set_experimental_options({"auto_mixed_precision": args.mxp})

class LocalTPUClusterResolver(
    tf.distribute.cluster_resolver.TPUClusterResolver):
    """LocalTPUClusterResolver."""

    def __init__(self):
        self._tpu = ''
        self.task_type = 'worker'
        self.task_id = 0

    def master(self, task_type=None, task_id=None, rpc_layer=None):
        return None

    def cluster_spec(self):
        return tf.train.ClusterSpec({})

    def get_tpu_system_metadata(self):
        return tf.tpu.experimental.TPUSystemMetadata(
            num_cores=8,
            num_hosts=1,
            num_of_cores_per_host=8,
            topology=None,
            devices=tf.config.list_logical_devices())

    def num_accelerators(self,
                         task_type=None,
                         task_id=None,
                         config_proto=None):
        return {'TPU': 8}


def setup_tpu():
    resolver = LocalTPUClusterResolver()
    # resolver = tf.distribute.cluster_resolver.TPUClusterResolver('local')
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    strategy = tf.distribute.TPUStrategy(resolver)
    logger.info('Using TPU Strategy.')
    return strategy

strategy = setup_tpu()

from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.configs.config import Config
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.featurizers import speech_featurizers, text_featurizers
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.models.transducer.conformer import Conformer
from experiments.ml.specaugment.conformer.TensorFlowASR.tensorflow_asr.optimizers.schedules import TransformerSchedule

config = Config(args.config)
speech_featurizer = speech_featurizers.TFSpeechFeaturizer(config.speech_config)

if args.sentence_piece:
    logger.info("Loading SentencePiece model ...")
    text_featurizer = text_featurizers.SentencePieceFeaturizer(config.decoder_config)
elif args.subwords:
    logger.info("Loading subwords ...")
    text_featurizer = text_featurizers.SubwordFeaturizer(config.decoder_config)
else:
    logger.info("Use characters ...")
    text_featurizer = text_featurizers.CharFeaturizer(config.decoder_config)

train_dataset = SnapshotDataset(
    speech_featurizer=speech_featurizer,
    text_featurizer=text_featurizer,
    **vars(config.learning_config.train_dataset_config),
    indefinite=False,
    num_elems_to_load=5000,
    pipeline=[],
    caching_period=0,
    snapshot_path=None,
    service_ip=None,
    wav=True,
    repeat_single_batch=True,
)
eval_dataset = SnapshotDataset(
    speech_featurizer=speech_featurizer,
    text_featurizer=text_featurizer,
    **vars(config.learning_config.eval_dataset_config),
    indefinite=False,
    num_elems_to_load=5000,
    pipeline=[],
    caching_period=0,
    snapshot_path=None,
    service_ip=None,
    wav=True,
    repeat_single_batch=True,
)
metadata = {
    "max_input_length": 2453,
    "max_label_length": 398,
    "num_entries": 28539
}
train_dataset.load_metadata(metadata)
eval_dataset.load_metadata(metadata)

# if not args.static_length:
#     speech_featurizer.reset_length()
#     text_featurizer.reset_length()

global_batch_size = args.bs or config.learning_config.running_config.batch_size
global_batch_size *= 8#strategy.num_replicas_in_sync

train_data_loader = train_dataset.create(global_batch_size)
eval_data_loader = eval_dataset.create(global_batch_size)

with strategy.scope():
    # build model
    conformer = Conformer(**config.model_config, vocabulary_size=text_featurizer.num_classes)
    conformer.make(
        speech_featurizer.shape,
        prediction_shape=text_featurizer.prepand_shape,
        batch_size=global_batch_size
    )
    if args.pretrained:
        conformer.load_weights(args.pretrained, by_name=True, skip_mismatch=True)
    conformer.summary(line_length=100)
    optimizer = tf.keras.optimizers.Adam(
        TransformerSchedule(
            d_model=conformer.dmodel,
            warmup_steps=config.learning_config.optimizer_config.pop("warmup_steps", 10000),
            max_lr=(0.05 / math.sqrt(conformer.dmodel))
        ),
        **config.learning_config.optimizer_config
    )
    conformer.compile(
        optimizer=optimizer,
        experimental_steps_per_execution=args.spx,
        global_batch_size=global_batch_size,
        blank=text_featurizer.blank
    )

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(**config.learning_config.running_config.checkpoint),
    tf.keras.callbacks.experimental.BackupAndRestore(config.learning_config.running_config.states_dir),
    tf.keras.callbacks.TensorBoard(**config.learning_config.running_config.tensorboard)
]

conformer.fit(
    train_data_loader,
    epochs=config.learning_config.running_config.num_epochs,
    validation_data=eval_data_loader,
    callbacks=callbacks,
    steps_per_epoch=train_dataset.total_steps,
    validation_steps=eval_dataset.total_steps if eval_data_loader else None
)

Snapshot dataset is a class very similar to ASRDataset with slight changes. I'm using Librispeech for the dataset.

lamyiowce avatar Jul 27 '21 15:07 lamyiowce

I've managed to run models on tpu colab with gcp You can try running the script on tpu colab to see if it works Make sure you use the supported dtype (tpu does not support tf.string, so every data points in the tfdataset must be number), fixed shape (with padding for variant sizes)

nglehuy avatar Aug 01 '21 17:08 nglehuy

I’ll close the issue here due to inactivity. Feel free to reopen if you have further questions.

nglehuy avatar Sep 02 '22 05:09 nglehuy