Does in-place update of hidden states improve inference speed on GPU?

Open leod opened this issue 6 years ago • 0 comments

I see that lingvo uses inplace_ops.alias_inplace_update for updating states when decoding with Transformer models on TPU: https://github.com/tensorflow/lingvo/blob/e09558f5f01d17e59f5bac54313363817ed6f0a5/lingvo/core/attention.py#L1440 I understand that this is done because of static shape requirements for TPU.

On GPU, it looks like tf.concat is used instead. I've been wondering if the in-place approach might also be useful for improving inference speed on GPU, assuming that tf.concat has to copy the whole state in each time step. Does anyone have any insights on this?

Sep 11 '19 19:09 leod