lingvo Attemp to add schedule sampling in ComputePredictionFunctional failed

https://github.com/tensorflow/lingvo/blob/2d05484a7d5d73db23f8a4b47d6d729b5e01fa6a/lingvo/tasks/asr/decoder.py#L1059

i found that ComputePredictionDynamic is too slow for schedule sampling, so i tried to add schedule sampling in the cell function RnnStep but failed following is my code

def ScheduledSampling(state0, inputs):
        pick_groundtruth = tf.less(
            tf.random_uniform([dec_bs], seed=p.random_seed),
            state0.misc_states.groundtruth_p)
        emb_ids = tf.stop_gradient(state0.misc_states.prev_predicted_ids)
        curr_emb = self.emb.EmbLookupDefaultTheta(emb_ids)
        target_emb = tf.where(pick_groundtruth,
                              inputs.emb,
                              curr_emb)
        #inputs.id outside is int32, but inside is int64
        target_id = tf.where(pick_groundtruth,
                             inputs.id,
                             tf.cast(state0.misc_states.prev_predicted_ids, inputs.id.dtype))

        return py_utils.NestedMap(id=target_id,
                                  label=inputs.label,
                                  weight=inputs.weight,
                                  emb=target_emb,
                                  padding=inputs.padding,
                                  misc=inputs.misc)

      def RnnStep(recurrent_theta, state0, inputs):
        self._max_label_prob = 0.1
        theta = recurrent_theta.theta
        packed_src = recurrent_theta.packed_src
        # Use different id and embedding for scheduled sampling.
        if self._max_label_prob > 0:
          inputs = ScheduledSampling(state0, inputs)

        """Computes one rnn step."""
        with tf.name_scope('single_decode_step'):
          step_outs, state1 = self.SingleDecodeStep(
              theta,
              packed_src,
              inputs,
              state0,
              use_deterministic_random=True)
          state1.step_outs = step_outs

        if self._max_label_prob > 0:
          # Compute logits.
          logits = self.softmax.Logits(theta.softmax, [step_outs])
          state1 = self.PostStepDecoderStateUpdate(state1, logits)
        else:
          state1 = self.PostStepDecoderStateUpdate(state1, inputs.label)

        return state1, py_utils.NestedMap()

program failed in the step:

curr_emb = self.emb.EmbLookupDefaultTheta(emb_ids)

following is the error info

I0625 13:29:42.561038 140512958846720 base_runner.py:236] trainer done (fatal error).
I0625 13:29:42.561534 140512958846720 base_runner.py:115] trainer exception: Combined status information from 5 operations:

Status code: Cancelled [2x]

         [[{{node While}}]]
         [[fprop/Cheji/tower_0_2/enc/Forward_M033FFondj4_3]] [1x]
  
         [[{{node While}}]]
         [[fprop/Cheji/tower_0_3/enc/Forward_sqqax2No8xE_3]] [1x]
Status code: Not found [3x]
  No registered 'DynamicPartition' OpKernel for GPU devices compatible with node {{node ForwardLoopBody_IT6gRuK6Trc/Fwd_yuDjrWf0kAk/embedding_lookup/DynamicPartition}}
         (OpKernel was found, but attributes didn't match) Requested Attributes: T=DT_INT32, num_partitions=8, _device="/job:local/replica:0/task:0/device:GPU:1"
        .  Registered:  device='CPU'; T in [DT_VARIANT]
    device='CPU'; T in [DT_RESOURCE]
    device='CPU'; T in [DT_STRING]
    device='CPU'; T in [DT_BOOL]
    device='CPU'; T in [DT_COMPLEX128]
    device='CPU'; T in [DT_COMPLEX64]
    device='CPU'; T in [DT_DOUBLE]
    device='CPU'; T in [DT_FLOAT]
    device='CPU'; T in [DT_BFLOAT16]
    device='CPU'; T in [DT_HALF]
    device='CPU'; T in [DT_INT8]
    device='CPU'; T in [DT_UINT8]
    device='CPU'; T in [DT_INT16]
    device='CPU'; T in [DT_UINT16]
    device='CPU'; T in [DT_INT32]
    device='CPU'; T in [DT_INT64]
    device='GPU'; T in [DT_COMPLEX128]
    device='GPU'; T in [DT_COMPLEX64]
    device='GPU'; T in [DT_DOUBLE]
    device='GPU'; T in [DT_FLOAT]
    device='GPU'; T in [DT_HALF]
  
         [[ForwardLoopBody_IT6gRuK6Trc/Fwd_yuDjrWf0kAk/embedding_lookup/DynamicPartition]]
         [[While]]
         [[ArithmeticOptimizer/AddOpsRewrite_add_31_G1154]] [1x]

it seems that the attribure is change in the rnn step any solution? really appreciate.

Jun 25 '19 05:06 boji123

As the error says it seems the EmbLookupDefaultTheta cannot be made with int32 dtype. Try casting emb_ids to float32?

Jun 25 '19 19:06 jonathanasdf

https://tensorflow.google.cn/api_docs/python/tf/nn/embedding_lookup the function

tf.nn.embedding_lookup

use in

curr_emb = self.emb.EmbLookupDefaultTheta(emb_ids)

require int32 or int64, and both type is tried but failed

Jun 26 '19 01:06 boji123

Hmm...

To check if EmbLookupDefaultTheta is actually the problem, can you replace that line with curr_emb = tf.zeros(expected_size) and see if everything runs fine?

Jun 26 '19 01:06 jonathanasdf

i think it's about device. i can run this function on cpu, but failed on gpu

Jun 26 '19 01:06 boji123

Yes but from the message it seems to be a problem with embedding_lookup but tf.nn.embedding_lookup should be supported on GPU.

Jun 26 '19 01:06 jonathanasdf

this is the way i call this function, i need to set allow_implicit_capture=True or another assertion error will jump out, maybe the question is about the mechanism of core function recurrent.Recurrent

      accumulated_states, _ = recurrent.Recurrent(
          recurrent_theta, state0_no_fusion, inputs, RnnStep, allow_implicit_capture=True)

Jun 26 '19 01:06 boji123

after replace that line as you said, i can run this function successfully

Hmm... To check if EmbLookupDefaultTheta is actually the problem, can you replace that line with curr_emb = tf.zeros(expected_size) and see if everything runs fine?

Jun 26 '19 01:06 boji123

i guess that when using gpu, recurrent.Recurrent sets everything in cell_fn to run on gpu, but embedding can only run on cpu, thus there is no node for gpu embedding

Jun 26 '19 01:06 boji123

Yes, that is exactly the problem, except that I thought tf.nn.embedding_lookup was supposed to work on GPU.

Otherwise, if it is not possible to use tf.nn.embedding_lookup inside Recurrent on GPU, then you will need to implement your own version of embedding lookup that does work. It should be possible using tf.gather.

Jun 28 '19 19:06 jonathanasdf

Schedule sampling in while loop is much slower than I expected. For example, a model with speed 60example/second in Recurrent function, can only reach speed 24example/second after using ss in dynamic while loop.

Jun 30 '19 09:06 boji123

watch this.

Jul 13 '19 09:07 zh794390558