lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Waymo car starnet task training on TPU error No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}}

Open JWHennessey opened this issue 5 years ago • 4 comments

I am running into an error when trying to train the waymo car starnet task using a TPU. It seems that a custom op is perhaps not compiled.

Any suggestions on how to resolve it?

Setup

  • On GCP I have a VM running Ubuntu 18.04 with a TPU v3-8
  • Docker image with base image tensorflow/tensorflow:2.1.0

Once docker image built and verify the TPU is available with tf.distribute.cluster_resolver.TPUClusterResolver('grpc://XXXXXXXX:8470')

  • bazel build -c opt //lingvo:trainer
  • bazel-bin/lingvo/trainer --logtostderr --model=car.waymo.StarNetVehicle --mode=async --logdir=gs://{BUCKET}/logs/starnet.waymo.3d_object.v1 \ --tpu=grpc://XXXXXXXX:8470 --run_locally=tpu --xla_device=tpu --tpu_compatible=true --worker_tpus=1

Results in the following error:

`I0511 12:35:08.059724 140301230589760 base_model.py:528] BaseTask.AdjustGradients I0511 12:35:10.348839 140301230589760 trainer.py:672] Trainer number of enqueue ops: 1 I0511 12:35:15.939270 140301230589760 trainer.py:1620] Starting runners I0511 12:35:16.471422 140288135726848 base_runner.py:192] trainer started. I0511 12:35:16.471785 140301230589760 trainer.py:1629] Total num runner.enqueue_ops: 1 I0511 12:35:16.472355 140301230589760 trainer.py:1633] Starting enqueue op group_deps I0511 12:35:16.473102 140301230589760 trainer.py:1641] Waiting for runners to finish... I0511 12:35:16.473241 140301230589760 trainer.py:1643] Waiting for thread to finish: <main.TrainerTpu object at 0x7f99ac0e6b00> I0511 12:35:16.473450 140288127334144 base_runner.py:192] trainer/enqueue_op/group_deps started. I0511 12:35:26.430277 140288135726848 trainer.py:832] TrainerTpu: Force restore or initialize. I0511 12:35:26.850680 140288135726848 checkpointer.py:146] Uninitialized var list: [] I0511 12:35:28.736465 140288135726848 checkpointer.py:163] Initialized all vars. I0511 12:35:31.464954 140288127334144 checkpointer.py:212] Initializing global step I0511 12:35:32.240896 140288127334144 base_runner.py:291] params.train.max_steps: 742317, enqueue_max_steps: -1 I0511 12:35:32.638819 140288127334144 base_runner.py:305] Current global_enqueue_steps: 0, local_enqueue_steps: 0, global_step: 0 E0511 12:35:32.689308 140288127334144 base_runner.py:244] trainer/enqueue_op/group_deps done (fatal error): <class 'ValueError'> I0511 12:35:32.689540 140288127334144 base_runner.py:111] trainer/enqueue_op/group_deps exception: Fetch argument <tf.Operation 'group_deps' type=NoOp> cannot be interpreted as a Tensor. (Operation name: "group_deps" op: "NoOp" input: "^InfeedQueue/enqueue/0" input: "^InfeedQueue/enqueue/1" input: "^InfeedQueue/enqueue/2" input: "^InfeedQueue/enqueue/3" input: "^InfeedQueue/enqueue/4" input: "^InfeedQueue/enqueue/5" input: "^InfeedQueue/enqueue/6" input: "^InfeedQueue/enqueue/7" device: "/task:0/device:CPU:0" is not an element of this graph.)

E0511 12:35:32.691849 140288127334144 base_runner.py:251] Traceback (most recent call last): E0511 12:35:32.691971 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 305, in init E0511 12:35:32.692037 140288127334144 base_runner.py:251] fetch, allow_tensor=True, allow_operation=True)) E0511 12:35:32.692105 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3505, in as_graph_element E0511 12:35:32.692166 140288127334144 base_runner.py:251] return self._as_graph_element_locked(obj, allow_tensor, allow_operation) E0511 12:35:32.692225 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3589, in _as_graph_element_locked E0511 12:35:32.692306 140288127334144 base_runner.py:251] raise ValueError("Operation %s is not an element of this graph." % obj) E0511 12:35:32.692363 140288127334144 base_runner.py:251] ValueError: Operation name: "group_deps" E0511 12:35:32.692418 140288127334144 base_runner.py:251] op: "NoOp" E0511 12:35:32.692476 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/0" E0511 12:35:32.692533 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/1" E0511 12:35:32.692584 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/2" E0511 12:35:32.692636 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/3" E0511 12:35:32.692690 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/4" E0511 12:35:32.692745 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/5" E0511 12:35:32.692799 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/6" E0511 12:35:32.692854 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/7" E0511 12:35:32.692908 140288127334144 base_runner.py:251] device: "/task:0/device:CPU:0" E0511 12:35:32.692963 140288127334144 base_runner.py:251] is not an element of this graph. E0511 12:35:32.693018 140288127334144 base_runner.py:251] E0511 12:35:32.693072 140288127334144 base_runner.py:251] During handling of the above exception, another exception occurred: E0511 12:35:32.693134 140288127334144 base_runner.py:251] E0511 12:35:32.693192 140288127334144 base_runner.py:251] Traceback (most recent call last): E0511 12:35:32.693247 140288127334144 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 193, in _RunLoop E0511 12:35:32.693302 140288127334144 base_runner.py:251] loop_func(*loop_args) E0511 12:35:32.693353 140288127334144 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 807, in _LoopEnqueue E0511 12:35:32.693425 140288127334144 base_runner.py:251] return super(TrainerTpu, self)._LoopEnqueue(op, sess) E0511 12:35:32.693485 140288127334144 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 332, in _LoopEnqueue E0511 12:35:32.693537 140288127334144 base_runner.py:251] sess.run([op]) E0511 12:35:32.693594 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run E0511 12:35:32.693650 140288127334144 base_runner.py:251] run_metadata_ptr) E0511 12:35:32.693708 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1168, in _run E0511 12:35:32.693763 140288127334144 base_runner.py:251] self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles) E0511 12:35:32.693818 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 477, in init E0511 12:35:32.693869 140288127334144 base_runner.py:251] self._fetch_mapper = _FetchMapper.for_fetch(fetches) E0511 12:35:32.693921 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 266, in for_fetch E0511 12:35:32.693982 140288127334144 base_runner.py:251] return _ListFetchMapper(fetch) E0511 12:35:32.694037 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 378, in init E0511 12:35:32.694100 140288127334144 base_runner.py:251] self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches] E0511 12:35:32.694156 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 378, in E0511 12:35:32.694211 140288127334144 base_runner.py:251] self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches] E0511 12:35:32.694265 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 276, in for_fetch E0511 12:35:32.694320 140288127334144 base_runner.py:251] return _ElementFetchMapper(fetches, contraction_fn) E0511 12:35:32.694375 140288127334144 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 312, in init E0511 12:35:32.694436 140288127334144 base_runner.py:251] 'Tensor. (%s)' % (fetch, str(e))) E0511 12:35:32.694492 140288127334144 base_runner.py:251] ValueError: Fetch argument <tf.Operation 'group_deps' type=NoOp> cannot be interpreted as a Tensor. (Operation name: "group_deps" E0511 12:35:32.694547 140288127334144 base_runner.py:251] op: "NoOp" E0511 12:35:32.694599 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/0" E0511 12:35:32.694665 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/1" E0511 12:35:32.694718 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/2" E0511 12:35:32.694782 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/3" E0511 12:35:32.694836 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/4" E0511 12:35:32.694891 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/5" E0511 12:35:32.694946 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/6" E0511 12:35:32.694998 140288127334144 base_runner.py:251] input: "^InfeedQueue/enqueue/7" E0511 12:35:32.695053 140288127334144 base_runner.py:251] device: "/task:0/device:CPU:0" E0511 12:35:32.695112 140288127334144 base_runner.py:251] is not an element of this graph.) E0511 12:35:32.695164 140288127334144 base_runner.py:251] E0511 12:35:36.083574 140288135726848 base_runner.py:244] trainer done (fatal error): <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'> I0511 12:35:36.083787 140288135726848 base_runner.py:111] trainer exception: From /job:trainer_client/replica:0/task:0: Compilation failure: Detected unsupported operations when trying to compile graph while_body_8_const_0[] on XLA_TPU_JIT: GenericInput (No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}} [[while]] TPU compilation failed [[tpu_compile_succeeded_assert/_12927813258871961422/_180]]

E0511 12:35:36.084203 140288135726848 base_runner.py:251] Traceback (most recent call last): E0511 12:35:36.084291 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call E0511 12:35:36.084357 140288135726848 base_runner.py:251] return fn(*args) E0511 12:35:36.084417 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn E0511 12:35:36.084482 140288135726848 base_runner.py:251] target_list, run_metadata) E0511 12:35:36.084538 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun E0511 12:35:36.084595 140288135726848 base_runner.py:251] run_metadata) E0511 12:35:36.084659 140288135726848 base_runner.py:251] tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:trainer_client/replica:0/task:0: E0511 12:35:36.084712 140288135726848 base_runner.py:251] Compilation failure: Detected unsupported operations when trying to compile graph while_body_8_const_0[] on XLA_TPU_JIT: GenericInput (No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}} E0511 12:35:36.084770 140288135726848 base_runner.py:251] [[while]] E0511 12:35:36.084825 140288135726848 base_runner.py:251] TPU compilation failed E0511 12:35:36.084882 140288135726848 base_runner.py:251] [[tpu_compile_succeeded_assert/_12927813258871961422/_180]] E0511 12:35:36.084939 140288135726848 base_runner.py:251] E0511 12:35:36.084993 140288135726848 base_runner.py:251] During handling of the above exception, another exception occurred: E0511 12:35:36.085047 140288135726848 base_runner.py:251] E0511 12:35:36.085102 140288135726848 base_runner.py:251] Traceback (most recent call last): E0511 12:35:36.085156 140288135726848 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/base_runner.py", line 193, in _RunLoop E0511 12:35:36.085211 140288135726848 base_runner.py:251] loop_func(*loop_args) E0511 12:35:36.085262 140288135726848 base_runner.py:251] File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 865, in _Loop E0511 12:35:36.085312 140288135726848 base_runner.py:251] values, outfeeds = sess.run(self._tpu_train_ops) E0511 12:35:36.085363 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 960, in run E0511 12:35:36.085444 140288135726848 base_runner.py:251] run_metadata_ptr) E0511 12:35:36.085499 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1183, in _run E0511 12:35:36.085554 140288135726848 base_runner.py:251] feed_dict_tensor, options, run_metadata) E0511 12:35:36.085609 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run E0511 12:35:36.085664 140288135726848 base_runner.py:251] run_metadata) E0511 12:35:36.085719 140288135726848 base_runner.py:251] File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call E0511 12:35:36.085774 140288135726848 base_runner.py:251] raise type(e)(node_def, op, message) E0511 12:35:36.085831 140288135726848 base_runner.py:251] tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:trainer_client/replica:0/task:0: E0511 12:35:36.085887 140288135726848 base_runner.py:251] Compilation failure: Detected unsupported operations when trying to compile graph while_body_8_const_0[] on XLA_TPU_JIT: GenericInput (No registered 'GenericInput' OpKernel for XLA_TPU_JIT devices compatible with node {{node GenericInput}}){{node GenericInput}} E0511 12:35:36.085943 140288135726848 base_runner.py:251] [[while]] E0511 12:35:36.085999 140288135726848 base_runner.py:251] TPU compilation failed E0511 12:35:36.086055 140288135726848 base_runner.py:251] [[tpu_compile_succeeded_assert/_12927813258871961422/_180]] E0511 12:35:36.086110 140288135726848 base_runner.py:251] `

JWHennessey avatar May 11 '20 12:05 JWHennessey

My colleagues tell me that adding "tf.disable_v2_behavior()" in lingvo/trainer.py as the first thing in main() addresses this problem -- they'll work to get this upstreamed, but you might have to hack that in for now.

vrv avatar May 12 '20 18:05 vrv

Thanks for the reply @vrv.

I added tf.compat.v1.disable_v2_behavior() which has stoped the error above.

I also had to change

SUPPORTED_SPLIT_SIZE = {
    1: [1, 1, 1, 1],
....

to

SUPPORTED_SPLIT_SIZE = {
    1: [1, 1, 1],
....

To prevent another error around incorrect array shape.

However, now it seems to get to print out the below and then hang.

If I use capture_tpu_profile then I just get ' TPU type: TPU v3 Utilization of TPU Matrix Units (higher is better): 0.000%'

......
fc0/fc0/key_seq/bias/b/var/Adam_1', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/bias/b/var/ExponentialMovingAverage', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var/Adam', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var/Adam_1', b'starnet/feat/validated_validated_fc0/validated_fc0/fc0/key_seq/linear/w/var/ExponentialMovingAverage', b'starnet/localization_regressor/w/var', b'starnet/localization_regressor/w/var/Adam', b'starnet/localization_regressor/w/var/Adam_1', b'starnet/localization_regressor/w/var/ExponentialMovingAverage']
I0512 20:07:16.845308 140021715203840 checkpointer.py:163] Initialized all vars.
I0512 20:07:20.999117 140019060754176 checkpointer.py:212] Initializing global step
I0512 20:07:21.992286 140019060754176 base_runner.py:291] params.train.max_steps: 742317, enqueue_max_steps: -1
I0512 20:07:22.472938 140019060754176 base_runner.py:305] Current global_enqueue_steps: 0, local_enqueue_steps: 0, global_step: 0

The parameters I am using are the following

bazel-bin/lingvo/trainer --logtostderr \
      --model=car.waymo.StarNetVehicle --mode=sync --logdir=gs://waymo-challange/logs/starnet.waymo.3d_object.v4 \
      -tpu=grpc://10.240.1.2:8470 ---tpu_compatible=True --worker_tpus=1 --checkpoint_in_trainer_tpu=True --saver_keep_checkpoint_every_n_hours=0.1

Any other pointers much appreciated. In the meantime I will look into the solution you proposed on the other issue I raised regarding using the GPU. Thanks

JWHennessey avatar May 12 '20 20:05 JWHennessey

Hm, that's weird, and sorry for the change to the split sizes - I think there's a bug introduced recently there that my colleagues will look into.

As for the current problem, it's pretty difficult to debug when there are no helpful logs anywhere, so for now we might have to do a best guess. First obvious thing to check is: https://github.com/tensorflow/lingvo/blob/cfafec0d723f0ce7e6ce8d0c1e1f4589cb5a077f/lingvo/tasks/car/params/waymo.py#L39

Have you made sure to set that to the path of the waymo training data you are using?

vrv avatar May 12 '20 23:05 vrv

Also apparently there are some helpful logs in the "stackdriver logs" page on the compute/tpus/details page under "Logs" -- that might provide some more information about what's going on. Let us know if that helps!

vrv avatar May 12 '20 23:05 vrv