HybridBackend icon indicating copy to clipboard operation
HybridBackend copied to clipboard

No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by node

Open karterotte opened this issue 2 years ago • 0 comments

Current behavior

I'm using docker image from "alideeprec/deeprec-release:deeprec2306-gpu-py38-cu116-ubuntu20.04-hybridbackend" and find error in my training process This is log:

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-08-23 15:35:03.977041: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900000000 Hz
2023-08-23 15:35:03.986474: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xd8fa950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-08-23 15:35:03.986505: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-08-23 15:35:03.989147: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
2023-08-23 15:35:04.000592: E tensorflow/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2023-08-23 15:35:04.000614: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
INFO:tensorflow:run without loading checkpoint
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1374, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1357, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1397, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by {{node gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad}}with these attrs: [Tidx=DT_INT64, Tsegmentids=DT_INT32, T=DT_FLOAT, N=1]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]

	 [[gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 22, in <module>
    do_train(UserModel, user_params)
  File "/var/workspace/utils/job.py", line 100, in do_train
    hb.estimator.train_and_evaluate(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 553, in train_and_evaluate
    return estimator.train_and_evaluate(train_spec, eval_spec, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 336, in train_and_evaluate
    return self.train(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/estimator/estimator.py", line 209, in train
    return super().train(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1174, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1206, in _train_model_default
    return self._train_with_estimator_spec(estimator_spec, worker_hooks,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1491, in _train_with_estimator_spec
    with training.MonitoredTrainingSession(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/training/session.py", line 129, in HybridBackendMonitoredTrainingSession
    sess = fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 660, in MonitoredTrainingSession
    return MonitoredSession(
  File "/usr/local/lib/python3.8/dist-packages/hybridbackend/tensorflow/training/session.py", line 63, in __init__
    super(cls, self).__init__(  # pylint: disable=bad-super-call
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 805, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1292, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 958, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 718, in create_session
    return self._get_session_manager().prepare_session(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 964, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1188, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1367, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1393, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by node gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [Tidx=DT_INT64, Tsegmentids=DT_INT32, T=DT_FLOAT, N=1]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]; Tsegmentids in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]; Tsegmentids in [DT_INT32]

	 [[gradients/wide_deep/input_layer/bg_play_all_albums_last_15d___deviceId___device__embedding/bg_play_all_albums_last_15d___deviceId___device__embedding_weights/embedding_lookup_sparse_grad/SparseSegmentMeanGrad]]

System information

  • GPU model and memory: 3090
  • OS Platform: ubuntu20.04
  • Docker version:
  • GCC/CUDA/cuDNN version: cu116
  • Python/conda version: py38
  • TensorFlow/PyTorch version: tf1.15

Yes

karterotte avatar Aug 23 '23 08:08 karterotte