HybridBackend icon indicating copy to clipboard operation
HybridBackend copied to clipboard

Op type not registered 'HbGetNcclId' in binary

Open ZhuYuJin opened this issue 1 year ago • 0 comments

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn self._extend_graph() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph tf_session.ExtendSession(self._session) tensorflow.python.framework.errors_impl.InvalidArgumentError: Op type not registered 'HbGetNcclId' in binary running on deeprec_with_io-hb. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. while building NodeDef 'collective_id_rpc_broadcast/replicas/0/collective_id_rpc_broadcast/replicas/0/GetCollectiveId'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "clk_model/clk_model_v4_gpu_debug.py", line 586, in tf.app.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "clk_model/clk_model_v4_gpu_debug.py", line 572, in main model.run() File "clk_model/clk_model_v4_gpu_debug.py", line 498, in run classifier.train_and_evaluate(train_spec, eval_spec) File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 334, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 646, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 652, in run_chief return self._start_distributed_training() File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 829, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 211, in train saving_listeners=saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 373, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1229, in _train_model_default saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1554, in _train_with_estimator_spec log_step_count_steps=log_step_count_steps) as mon_sess: File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 129, in HybridBackendMonitoredTrainingSession sess = fn(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 660, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python3.6/dist-packages/hybridbackend/tensorflow/training/session.py", line 64, in init session_creator, hooks, should_recover=True, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 802, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1290, in init _WrappedSession.init(self, self._create_session()) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1295, in _create_session return self._sess_creator.create_session() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 955, in create_session self.tf_sess = self._session_creator.create_session() File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 724, in create_session init_fn=self._scaffold.init_fn) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Op type not registered 'HbGetNcclId' in binary running on deeprec_with_io-hb. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. while building NodeDef 'collective_id_rpc_broadcast/replicas/0/collective_id_rpc_broadcast/replicas/0/GetCollectiveId'

ZhuYuJin avatar Mar 28 '24 02:03 ZhuYuJin