FasterTransformer Add support for head_dim > 1024 for fp16, no whitespace change

Thanks to @842974287 's implementation. Add head_dim > 1024 for fp16 in add_QKV_bias_rebuild_padding add_bias_input_layernorm

For the comments in https://github.com/842974287/FasterTransformer/commit/dacb3ceed52d6cdb59f10adc6fa02f615da9084a

When word_per_block != 1, dim3 grid(m * half_k / block.x / word_per_block * 3); could generate remainder, which might cause problem.
This diff now contains the implementation of add_bias_input_layernorm when headdim > 1024.

Please let me know if this pr is good for commit, or we need to modify. Thanks!

Oct 08 '21 23:10 zrphercule

cc @byshiue

Oct 08 '21 23:10 zrphercule

I cannot compile the codes successfully. Even if I fix the issue, I will get wrong results when I run the hidden_dim > 1024. How do you verify the correctness?

Oct 09 '21 05:10 byshiue

I cannot compile the codes successfully. Even if I fix the issue, I will get wrong results when I run the hidden_dim > 1024. How do you verify the correctness?

Thanks for you reply!

We have some unit test to test its correctness internally, but I havent test this part of code in open source environment. I wonder what is your suggestion of testing it in open source?

Also, we will work on fixing https://github.com/NVIDIA/FasterTransformer/pull/104 and merge it as well recently :)

Oct 09 '21 10:10 zrphercule

Here is a simple unit test. You can add some cases with hidden_dimension > 1024 into the unit test.

The request of #104 are supported in next beta version.

Oct 10 '21 23:10 byshiue

next beta version

Great! Thanks! I wonder when will this beta version becoming a steady official releasing version, or it is already steady enough to be imported as thirdparty library?

Oct 11 '21 20:10 zrphercule

For your request and the BERT model, it should be steady. We release it as beta version because:

We may break the API again recently.
We still not update all guides. But the guide of BERT should be latest.

Oct 12 '21 00:10 byshiue