lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Attemp to add schedule sampling in ComputePredictionFunctional failed

Open boji123 opened this issue 6 years ago • 11 comments

https://github.com/tensorflow/lingvo/blob/2d05484a7d5d73db23f8a4b47d6d729b5e01fa6a/lingvo/tasks/asr/decoder.py#L1059

i found that ComputePredictionDynamic is too slow for schedule sampling, so i tried to add schedule sampling in the cell function RnnStep but failed following is my code

def ScheduledSampling(state0, inputs):
        pick_groundtruth = tf.less(
            tf.random_uniform([dec_bs], seed=p.random_seed),
            state0.misc_states.groundtruth_p)
        emb_ids = tf.stop_gradient(state0.misc_states.prev_predicted_ids)
        curr_emb = self.emb.EmbLookupDefaultTheta(emb_ids)
        target_emb = tf.where(pick_groundtruth,
                              inputs.emb,
                              curr_emb)
        #inputs.id outside is int32, but inside is int64
        target_id = tf.where(pick_groundtruth,
                             inputs.id,
                             tf.cast(state0.misc_states.prev_predicted_ids, inputs.id.dtype))

        return py_utils.NestedMap(id=target_id,
                                  label=inputs.label,
                                  weight=inputs.weight,
                                  emb=target_emb,
                                  padding=inputs.padding,
                                  misc=inputs.misc)

      def RnnStep(recurrent_theta, state0, inputs):
        self._max_label_prob = 0.1
        theta = recurrent_theta.theta
        packed_src = recurrent_theta.packed_src
        # Use different id and embedding for scheduled sampling.
        if self._max_label_prob > 0:
          inputs = ScheduledSampling(state0, inputs)

        """Computes one rnn step."""
        with tf.name_scope('single_decode_step'):
          step_outs, state1 = self.SingleDecodeStep(
              theta,
              packed_src,
              inputs,
              state0,
              use_deterministic_random=True)
          state1.step_outs = step_outs

        if self._max_label_prob > 0:
          # Compute logits.
          logits = self.softmax.Logits(theta.softmax, [step_outs])
          state1 = self.PostStepDecoderStateUpdate(state1, logits)
        else:
          state1 = self.PostStepDecoderStateUpdate(state1, inputs.label)

        return state1, py_utils.NestedMap()

program failed in the step:

curr_emb = self.emb.EmbLookupDefaultTheta(emb_ids)

following is the error info

I0625 13:29:42.561038 140512958846720 base_runner.py:236] trainer done (fatal error).
I0625 13:29:42.561534 140512958846720 base_runner.py:115] trainer exception: Combined status information from 5 operations:

Status code: Cancelled [2x]

         [[{{node While}}]]
         [[fprop/Cheji/tower_0_2/enc/Forward_M033FFondj4_3]] [1x]
  
         [[{{node While}}]]
         [[fprop/Cheji/tower_0_3/enc/Forward_sqqax2No8xE_3]] [1x]
Status code: Not found [3x]
  No registered 'DynamicPartition' OpKernel for GPU devices compatible with node {{node ForwardLoopBody_IT6gRuK6Trc/Fwd_yuDjrWf0kAk/embedding_lookup/DynamicPartition}}
         (OpKernel was found, but attributes didn't match) Requested Attributes: T=DT_INT32, num_partitions=8, _device="/job:local/replica:0/task:0/device:GPU:1"
        .  Registered:  device='CPU'; T in [DT_VARIANT]
    device='CPU'; T in [DT_RESOURCE]
    device='CPU'; T in [DT_STRING]
    device='CPU'; T in [DT_BOOL]
    device='CPU'; T in [DT_COMPLEX128]
    device='CPU'; T in [DT_COMPLEX64]
    device='CPU'; T in [DT_DOUBLE]
    device='CPU'; T in [DT_FLOAT]
    device='CPU'; T in [DT_BFLOAT16]
    device='CPU'; T in [DT_HALF]
    device='CPU'; T in [DT_INT8]
    device='CPU'; T in [DT_UINT8]
    device='CPU'; T in [DT_INT16]
    device='CPU'; T in [DT_UINT16]
    device='CPU'; T in [DT_INT32]
    device='CPU'; T in [DT_INT64]
    device='GPU'; T in [DT_COMPLEX128]
    device='GPU'; T in [DT_COMPLEX64]
    device='GPU'; T in [DT_DOUBLE]
    device='GPU'; T in [DT_FLOAT]
    device='GPU'; T in [DT_HALF]
  
         [[ForwardLoopBody_IT6gRuK6Trc/Fwd_yuDjrWf0kAk/embedding_lookup/DynamicPartition]]
         [[While]]
         [[ArithmeticOptimizer/AddOpsRewrite_add_31_G1154]] [1x]

it seems that the attribure is change in the rnn step any solution? really appreciate.

boji123 avatar Jun 25 '19 05:06 boji123

As the error says it seems the EmbLookupDefaultTheta cannot be made with int32 dtype. Try casting emb_ids to float32?

jonathanasdf avatar Jun 25 '19 19:06 jonathanasdf

https://tensorflow.google.cn/api_docs/python/tf/nn/embedding_lookup the function

tf.nn.embedding_lookup

use in

curr_emb = self.emb.EmbLookupDefaultTheta(emb_ids)

require int32 or int64, and both type is tried but failed

boji123 avatar Jun 26 '19 01:06 boji123

Hmm...

To check if EmbLookupDefaultTheta is actually the problem, can you replace that line with curr_emb = tf.zeros(expected_size) and see if everything runs fine?

jonathanasdf avatar Jun 26 '19 01:06 jonathanasdf

i think it's about device. i can run this function on cpu, but failed on gpu

boji123 avatar Jun 26 '19 01:06 boji123

Yes but from the message it seems to be a problem with embedding_lookup but tf.nn.embedding_lookup should be supported on GPU.

jonathanasdf avatar Jun 26 '19 01:06 jonathanasdf

this is the way i call this function, i need to set allow_implicit_capture=True or another assertion error will jump out, maybe the question is about the mechanism of core function recurrent.Recurrent

      accumulated_states, _ = recurrent.Recurrent(
          recurrent_theta, state0_no_fusion, inputs, RnnStep, allow_implicit_capture=True)

boji123 avatar Jun 26 '19 01:06 boji123

after replace that line as you said, i can run this function successfully

Hmm... To check if EmbLookupDefaultTheta is actually the problem, can you replace that line with curr_emb = tf.zeros(expected_size) and see if everything runs fine?

boji123 avatar Jun 26 '19 01:06 boji123

i guess that when using gpu, recurrent.Recurrent sets everything in cell_fn to run on gpu, but embedding can only run on cpu, thus there is no node for gpu embedding

boji123 avatar Jun 26 '19 01:06 boji123

Yes, that is exactly the problem, except that I thought tf.nn.embedding_lookup was supposed to work on GPU.

Otherwise, if it is not possible to use tf.nn.embedding_lookup inside Recurrent on GPU, then you will need to implement your own version of embedding lookup that does work. It should be possible using tf.gather.

jonathanasdf avatar Jun 28 '19 19:06 jonathanasdf

Schedule sampling in while loop is much slower than I expected. For example, a model with speed 60example/second in Recurrent function, can only reach speed 24example/second after using ss in dynamic while loop.

boji123 avatar Jun 30 '19 09:06 boji123

watch this.

zh794390558 avatar Jul 13 '19 09:07 zh794390558