Waymo car starnet task training on TPU error No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}}
I am running into an error when trying to train the waymo car starnet task using a TPU. It seems that a custom op is perhaps not compiled.
Any suggestions on how to resolve it?
Setup
- On GCP I have a VM running Ubuntu 18.04 with a TPU v3-8
- Docker image with base image tensorflow/tensorflow:2.1.0
Once docker image built and verify the TPU is available with tf.distribute.cluster_resolver.TPUClusterResolver('grpc://XXXXXXXX:8470')
-
bazel build -c opt //lingvo:trainer -
bazel-bin/lingvo/trainer --logtostderr --model=car.waymo.StarNetVehicle --mode=async --logdir=gs://{BUCKET}/logs/starnet.waymo.3d_object.v1 \ --tpu=grpc://XXXXXXXX:8470 --run_locally=tpu --xla_device=tpu --tpu_compatible=true --worker_tpus=1
Results in the following error:
`I0511 12:35:08.059724 140301230589760 base_model.py:528] BaseTask.AdjustGradients I0511 12:35:10.348839 140301230589760 trainer.py:672] Trainer number of enqueue ops: 1 I0511 12:35:15.939270 140301230589760 trainer.py:1620] Starting runners I0511 12:35:16.471422 140288135726848 base_runner.py:192] trainer started. I0511 12:35:16.471785 140301230589760 trainer.py:1629] Total num runner.enqueue_ops: 1 I0511 12:35:16.472355 140301230589760 trainer.py:1633] Starting enqueue op group_deps I0511 12:35:16.473102 140301230589760 trainer.py:1641] Waiting for runners to finish... I0511 12:35:16.473241 140301230589760 trainer.py:1643] Waiting for thread to finish: <main.TrainerTpu object at 0x7f99ac0e6b00> I0511 12:35:16.473450 140288127334144 base_runner.py:192] trainer/enqueue_op/group_deps started. I0511 12:35:26.430277 140288135726848 trainer.py:832] TrainerTpu: Force restore or initialize. I0511 12:35:26.850680 140288135726848 checkpointer.py:146] Uninitialized var list: [] I0511 12:35:28.736465 140288135726848 checkpointer.py:163] Initialized all vars. I0511 12:35:31.464954 140288127334144 checkpointer.py:212] Initializing global step I0511 12:35:32.240896 140288127334144 base_runner.py:291] params.train.max_steps: 742317, enqueue_max_steps: -1 I0511 12:35:32.638819 140288127334144 base_runner.py:305] Current global_enqueue_steps: 0, local_enqueue_steps: 0, global_step: 0 E0511 12:35:32.689308 140288127334144 base_runner.py:244] trainer/enqueue_op/group_deps done (fatal error): <class 'ValueError'> I0511 12:35:32.689540 140288127334144 base_runner.py:111] trainer/enqueue_op/group_deps exception: Fetch argument <tf.Operation 'group_deps' type=NoOp> cannot be interpreted as a Tensor. (Operation name: "group_deps" op: "NoOp" input: "^InfeedQueue/enqueue/0" input: "^InfeedQueue/enqueue/1" input: "^InfeedQueue/enqueue/2" input: "^InfeedQueue/enqueue/3" input: "^InfeedQueue/enqueue/4" input: "^InfeedQueue/enqueue/5" input: "^InfeedQueue/enqueue/6" input: "^InfeedQueue/enqueue/7" device: "/task:0/device:CPU:0" is not an element of this graph.)
E0511 12:35:32.691849 140288127334144 base_runner.py:251] Traceback (most recent call last):
E0511 12:35:32.691971 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 305, in init
E0511 12:35:32.692037 140288127334144 base_runner.py:251] fetch, allow_tensor=True, allow_operation=True))
E0511 12:35:32.692105 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3505, in as_graph_element
E0511 12:35:32.692166 140288127334144 base_runner.py:251] return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
E0511 12:35:32.692225 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3589, in _as_graph_element_locked
E0511 12:35:32.692306 140288127334144 base_runner.py:251] raise ValueError("Operation %s is not an element of this graph." % obj)
E0511 12:35:32.692363 140288127334144 base_runner.py:251] ValueError: Operation name: "group_deps"
E0511 12:35:32.692418 140288127334144 base_runner.py:251] op: "NoOp"
E0511 12:35:32.692476 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/0"
E0511 12:35:32.692533 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/1"
E0511 12:35:32.692584 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/2"
E0511 12:35:32.692636 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/3"
E0511 12:35:32.692690 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/4"
E0511 12:35:32.692745 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/5"
E0511 12:35:32.692799 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/6"
E0511 12:35:32.692854 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/7"
E0511 12:35:32.692908 140288127334144 base_runner.py:251] device: "/task:0/device:CPU:0"
E0511 12:35:32.692963 140288127334144 base_runner.py:251] is not an element of this graph.
E0511 12:35:32.693018 140288127334144 base_runner.py:251]
E0511 12:35:32.693072 140288127334144 base_runner.py:251] During handling of the above exception, another exception occurred:
E0511 12:35:32.693134 140288127334144 base_runner.py:251]
E0511 12:35:32.693192 140288127334144 base_runner.py:251] Traceback (most recent call last):
E0511 12:35:32.693247 140288127334144 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 193, in _RunLoop
E0511 12:35:32.693302 140288127334144 base_runner.py:251] loop_func(*loop_args)
E0511 12:35:32.693353 140288127334144 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 807, in _LoopEnqueue
E0511 12:35:32.693425 140288127334144 base_runner.py:251] return super(TrainerTpu, self)._LoopEnqueue(op, sess)
E0511 12:35:32.693485 140288127334144 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 332, in _LoopEnqueue
E0511 12:35:32.693537 140288127334144 base_runner.py:251] sess.run([op])
E0511 12:35:32.693594 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run
E0511 12:35:32.693650 140288127334144 base_runner.py:251] run_metadata_ptr)
E0511 12:35:32.693708 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1168, in _run
E0511 12:35:32.693763 140288127334144 base_runner.py:251] self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles)
E0511 12:35:32.693818 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 477, in init
E0511 12:35:32.693869 140288127334144 base_runner.py:251] self._fetch_mapper = _FetchMapper.for_fetch(fetches)
E0511 12:35:32.693921 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 266, in for_fetch
E0511 12:35:32.693982 140288127334144 base_runner.py:251] return _ListFetchMapper(fetch)
E0511 12:35:32.694037 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 378, in init
E0511 12:35:32.694100 140288127334144 base_runner.py:251] self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
E0511 12:35:32.694156 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 378, in
E0511 12:35:36.084203 140288135726848 base_runner.py:251] Traceback (most recent call last): E0511 12:35:36.084291 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call E0511 12:35:36.084357 140288135726848 base_runner.py:251] return fn(*args) E0511 12:35:36.084417 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn E0511 12:35:36.084482 140288135726848 base_runner.py:251] target_list, run_metadata) E0511 12:35:36.084538 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun E0511 12:35:36.084595 140288135726848 base_runner.py:251] run_metadata) E0511 12:35:36.084659 140288135726848 base_runner.py:251] tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:trainer_client/replica:0/task:0: E0511 12:35:36.084712 140288135726848 base_runner.py:251] Compilation failure: Detected unsupported operations when trying to compile graph while_body_8_const_0[] on XLA_TPU_JIT: GenericInput (No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}} E0511 12:35:36.084770 140288135726848 base_runner.py:251] [[while]] E0511 12:35:36.084825 140288135726848 base_runner.py:251] TPU compilation failed E0511 12:35:36.084882 140288135726848 base_runner.py:251] [[tpu_compile_succeeded_assert/_12927813258871961422/_180]] E0511 12:35:36.084939 140288135726848 base_runner.py:251] E0511 12:35:36.084993 140288135726848 base_runner.py:251] During handling of the above exception, another exception occurred: E0511 12:35:36.085047 140288135726848 base_runner.py:251] E0511 12:35:36.085102 140288135726848 base_runner.py:251] Traceback (most recent call last): E0511 12:35:36.085156 140288135726848 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 193, in _RunLoop E0511 12:35:36.085211 140288135726848 base_runner.py:251] loop_func(*loop_args) E0511 12:35:36.085262 140288135726848 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 865, in _Loop E0511 12:35:36.085312 140288135726848 base_runner.py:251] values, outfeeds = sess.run(self._tpu_train_ops) E0511 12:35:36.085363 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run E0511 12:35:36.085444 140288135726848 base_runner.py:251] run_metadata_ptr) E0511 12:35:36.085499 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run E0511 12:35:36.085554 140288135726848 base_runner.py:251] feed_dict_tensor, options, run_metadata) E0511 12:35:36.085609 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run E0511 12:35:36.085664 140288135726848 base_runner.py:251] run_metadata) E0511 12:35:36.085719 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call E0511 12:35:36.085774 140288135726848 base_runner.py:251] raise type(e)(node_def, op, message) E0511 12:35:36.085831 140288135726848 base_runner.py:251] tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:trainer_client/replica:0/task:0: E0511 12:35:36.085887 140288135726848 base_runner.py:251] Compilation failure: Detected unsupported operations when trying to compile graph while_body_8_const_0[] on XLA_TPU_JIT: GenericInput (No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}} E0511 12:35:36.085943 140288135726848 base_runner.py:251] [[while]] E0511 12:35:36.085999 140288135726848 base_runner.py:251] TPU compilation failed E0511 12:35:36.086055 140288135726848 base_runner.py:251] [[tpu_compile_succeeded_assert/_12927813258871961422/_180]] E0511 12:35:36.086110 140288135726848 base_runner.py:251] `
My colleagues tell me that adding "tf.disable_v2_behavior()" in lingvo/trainer.py as the first thing in main() addresses this problem -- they'll work to get this upstreamed, but you might have to hack that in for now.
Thanks for the reply @vrv.
I added tf.compat.v1.disable_v2_behavior() which has stoped the error above.
I also had to change
SUPPORTED_SPLIT_SIZE = {
1: [1, 1, 1, 1],
....
to
SUPPORTED_SPLIT_SIZE = {
1: [1, 1, 1],
....
To prevent another error around incorrect array shape.
However, now it seems to get to print out the below and then hang.
If I use capture_tpu_profile then I just get ' TPU type: TPU v3 Utilization of TPU Matrix Units (higher is better): 0.000%'
......
fc0/fc0/key_seq/bias/b/var/Adam_1', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/bias/b/var/ExponentialMovingAverage', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var/Adam', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var/Adam_1', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var/ExponentialMovingAverage', b'starnet/localization_regressor/w/var', b'starnet/localization_regressor/w/var/Adam', b'starnet/localization_regressor/w/var/Adam_1', b'starnet/localization_regressor/w/var/ExponentialMovingAverage']
I0512 20:07:16.845308 140021715203840 checkpointer.py:163] Initialized all vars.
I0512 20:07:20.999117 140019060754176 checkpointer.py:212] Initializing global step
I0512 20:07:21.992286 140019060754176 base_runner.py:291] params.train.max_steps: 742317, enqueue_max_steps: -1
I0512 20:07:22.472938 140019060754176 base_runner.py:305] Current global_enqueue_steps: 0, local_enqueue_steps: 0, global_step: 0
The parameters I am using are the following
bazel-bin/lingvo/trainer --logtostderr \
--model=car.waymo.StarNetVehicle --mode=sync --logdir=gs://waymo-challange/logs/starnet.waymo.3d_object.v4 \
-tpu=grpc://10.240.1.2:8470 ---tpu_compatible=True --worker_tpus=1 --checkpoint_in_trainer_tpu=True --saver_keep_checkpoint_every_n_hours=0.1
Any other pointers much appreciated. In the meantime I will look into the solution you proposed on the other issue I raised regarding using the GPU. Thanks
Hm, that's weird, and sorry for the change to the split sizes - I think there's a bug introduced recently there that my colleagues will look into.
As for the current problem, it's pretty difficult to debug when there are no helpful logs anywhere, so for now we might have to do a best guess. First obvious thing to check is: https://github.com/tensorflow/lingvo/blob/cfafec0d723f0ce7e6ce8d0c1e1f4589cb5a077f/lingvo/tasks/car/params/waymo.py#L39
Have you made sure to set that to the path of the waymo training data you are using?
Also apparently there are some helpful logs in the "stackdriver logs" page on the compute/tpus/details page under "Logs" -- that might provide some more information about what's going on. Let us know if that helps!