[Long seq length] GPT Seq length constrain
It seems that maximum seq length supported is 4096 for GPT:
(https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/models/multi_gpu_gpt/ParallelGptDecoderLayerWeight.h#L31)
Bert seems also have the same maximum seq length.
May I ask the following questions:
- where the constrain comes from, i.e., which kernel.
- Do you have a plan to support longer seq length?
Thanks!
Hi, for 1 question, there seems to be a constraint on softmax.
https://github.com/NVIDIA/FasterTransformer/issues/663
Hello, I see it. I'm trying to modify, in order to remove this constraint. But it seems to have problems with precision. Does anyone know the specific reason for using this constraint?
template<typename T, typename T_IN>
void invokeMaskedSoftmax(MaskedSoftmaxParam<T, T_IN>& param, cudaStream_t stream)
{
// attention_score, (batch_size, head_num, q_length, k_length), softmax output.
// qk, (batch_size, head_num, q_length, k_length), QK^T.
// attention_mask, (batch_size, q_length, k_length), attention mask.
// linear_bias_slopes, (head_num,) the slopes of the linear position bias.
dim3 grid(param.q_length, param.batch_size, param.num_heads);
if (param.batch_size * param.num_heads > 360) {
grid.x = ceil(float(param.q_length) / 32.0f);
}
bool is_half2 = sizeof(T) == 2 && sizeof(T_IN) == 2 && param.k_length % 2 == 0;
dim3 block((param.k_length / (is_half2 ? 2 : 1) + 31) / 32 * 32);
// Added modification
if (block.x > 4096 && block.x <= 8192) {
LAUNCH_MAKSED_SOFTMAX(8)
}
else if (block.x > 2048 && block.x <= 4096) {
LAUNCH_MAKSED_SOFTMAX(4)
}
else if (block.x > 1024) {
LAUNCH_MAKSED_SOFTMAX(2)
}
else if (block.x > 0) {
LAUNCH_MAKSED_SOFTMAX(1)
}
else {
FT_CHECK(param.k_length <= 4096);
}
}
Hello, I see it. I'm trying to modify, in order to remove this constraint. But it seems to have problems with precision. Does anyone know the specific reason for using this constraint?
template<typename T, typename T_IN> void invokeMaskedSoftmax(MaskedSoftmaxParam<T, T_IN>& param, cudaStream_t stream) { // attention_score, (batch_size, head_num, q_length, k_length), softmax output. // qk, (batch_size, head_num, q_length, k_length), QK^T. // attention_mask, (batch_size, q_length, k_length), attention mask. // linear_bias_slopes, (head_num,) the slopes of the linear position bias. dim3 grid(param.q_length, param.batch_size, param.num_heads); if (param.batch_size * param.num_heads > 360) { grid.x = ceil(float(param.q_length) / 32.0f); } bool is_half2 = sizeof(T) == 2 && sizeof(T_IN) == 2 && param.k_length % 2 == 0; dim3 block((param.k_length / (is_half2 ? 2 : 1) + 31) / 32 * 32); // Added modification if (block.x > 4096 && block.x <= 8192) { LAUNCH_MAKSED_SOFTMAX(8) } else if (block.x > 2048 && block.x <= 4096) { LAUNCH_MAKSED_SOFTMAX(4) } else if (block.x > 1024) { LAUNCH_MAKSED_SOFTMAX(2) } else if (block.x > 0) { LAUNCH_MAKSED_SOFTMAX(1) } else { FT_CHECK(param.k_length <= 4096); } }
Taking back what I just said, it seems that there is no problem with this change, at least compared with HuggingFace on a context sample close to 6K. Awaiting further testing.
Thanks @lkm2835 and @StarWorkXc.
Hi @StarWorkXc, I also tested with prompts length >7K, with similar modifications in my repo. The results are reasonable. I am thinking to submit a PR on that. What do you think?
Hi @StarWorkXc, I also tested with prompts length >7K, with similar modifications in my repo. The results are reasonable. I am thinking to submit a PR on that. What do you think?
我觉得可以,你提交就好。我还在尝试让它支持16k,但是比较麻烦。
Sent a PR. Could someone help to review.
我觉得可以,你提交就好。我还在尝试让它支持16k,但是比较麻烦。 @StarWorkXc 有啥好办法扩招到16k 或者更高么?
我觉得可以,你提交就好。我还在尝试让它支持16k,但是比较麻烦。 @StarWorkXc 有啥好办法扩招到16k 或者更高么?
限制主要在: src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp 如果GPU架构版本够高,就修改一下Kernel参数,让kernel支持当前架构的最大share memory 如果GPU不新,share memory最大是48KB,就尝试下面两个方向的修改: 1、参考Flash Attention,修改decoder_masked_multihead_attention_template的核心逻辑 2、把Attention Score存在Share Memory中的数据类型从fp32改为fp16 第二点比较简单,目前我的修改版本在96KB的ShareMemory的GPU上,最高支持32K的上下文。
@StarWorkXc 很好的思路。 方便share一下修改后的 branch 么?
TensorRT-LLM is released and FasterTransformer development has transitioned to TensorRT-LLM.
TensorRT-LLM also supports very long sequence length, please try that.
TensorRT-LLM is released and FasterTransformer development has transitioned to TensorRT-LLM.
TensorRT-LLM also supports very long sequence length, please try that.
@byshiue Did you mean FasterTransformer will be replaced by TensorRT-LLM and fastertransformer_backend will be replaced by tensorrtllm_backend ?
Yes.
Thanks @byshiue, I will try TRT-LLM