zrphercule

Results 2 issues of zrphercule

Summary: As we may automatically switch to use nested tensor, we need further support of this in torchtext, especially for return_all_layers Reviewed By: mikekgfb, parmeet Differential Revision: D36213184

cla signed
fb-exported

Thanks to @842974287 's implementation. Add head_dim > 1024 for fp16 in add_QKV_bias_rebuild_padding add_bias_input_layernorm For the comments in https://github.com/842974287/FasterTransformer/commit/dacb3ceed52d6cdb59f10adc6fa02f615da9084a 1. When word_per_block != 1, dim3 grid(m * half_k / block.x...