llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Flash attention implementations do not handle case where value vectors have different dimension from query vectors

Open fairydreaming opened this issue 1 year ago • 6 comments

For example in ggml.c implementations of ops related to flash attention declare variable D and use it as both dimension of value vector and dimension or key/query vector This will fail for models where query and value vectors have different lengths (for example DeepSeek-V2).

Below are selected fragments of GGML_OP_FLASH_ATTN_EXT op implementation to illustrate the problem.

Creation of result tensor:

https://github.com/ggerganov/llama.cpp/blob/51e9d02599336e62948d29f1d6c05addeb921ac2/ggml.c#L6792-L6793

(note that query tensor dimensions are used everywhere, while in reality ne[0] shall be equal to ne[0] of value tensor because the attention output is a linear combination of value vectors.

Definition of variable D:

https://github.com/ggerganov/llama.cpp/blob/51e9d02599336e62948d29f1d6c05addeb921ac2/ggml.c#L15879

Assertions all expecting the same length:

https://github.com/ggerganov/llama.cpp/blob/51e9d02599336e62948d29f1d6c05addeb921ac2/ggml.c#L15889-L15891

Usage of D as a dimension of a value vector:

https://github.com/ggerganov/llama.cpp/blob/51e9d02599336e62948d29f1d6c05addeb921ac2/ggml.c#L15958

Usage of D as a dimension of a query vector:

https://github.com/ggerganov/llama.cpp/blob/51e9d02599336e62948d29f1d6c05addeb921ac2/ggml.c#L15985-L15987

Suggested solution: create two variables Dq (length of the query vector) and Dv (length of value vector) and use Dq as a query/key vector length and Dv as value vector length. I fixed ggml_compute_forward_flash_attn_ext_f16() this way and it produces correct results (confirmed by running DeepSeek-V2 with -fa option).

I'm not 100% sure if CUDA and Metal implementations are also affected, but it's likely - I also found the same variable D used in the code and comments like "K and V have same shape".

fairydreaming avatar May 17 '24 15:05 fairydreaming

Thanks for reporting that - CUDA and Metal kernels should also be affected. We should fix that, but maybe after DS2 support is merged so we have something to test with

ggerganov avatar May 17 '24 16:05 ggerganov

Possibly relevant - https://github.com/ggerganov/llama.cpp/issues/2445#issuecomment-2143326075

oldgithubman avatar Jun 01 '24 07:06 oldgithubman

@JohannesGaessler don't wanna bother you but I assume that you'd be the best suited to handle this or at least shed some light onto how to handle it

bartowski1182 avatar Jun 17 '24 21:06 bartowski1182

I don't think you need any special considerations in terms of program correctness - you would just have to implement it. The bigger challenge will be to implement this in such a way that the performance is good.

JohannesGaessler avatar Jun 17 '24 21:06 JohannesGaessler

Btw, supporting different K/V head sizes might dramatically increase the number of FA kernels that we have to compile, so probably not really worth it

ggerganov avatar Jun 18 '24 06:06 ggerganov

@ggerganov reopen? FA would be very useful for deepseek-v2

oldgithubman avatar Jul 21 '24 07:07 oldgithubman

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Sep 04 '24 01:09 github-actions[bot]

Any interest in re-opening, now that we have DS-R1?

grimulkan avatar Jan 31 '25 05:01 grimulkan

I should maybe mention, right now I'm rewriting the CUDA code for FlashAttention using tensor cores. There are still some performance issues for small batch sizes, if things go well I'll make a PR later today.

JohannesGaessler avatar Jan 31 '25 07:01 JohannesGaessler

Thanks for the update. For the Metal implementation, I am not yet working on it and won't start for at least a week probably. If anyone is interested at taking a stab - please go ahead.

ggerganov avatar Jan 31 '25 08:01 ggerganov

Also just to be clear: I am not adding support for any new models, I'm just working on the performance by using PTX mma instructions instead of the "high-level" WMMA interface.

JohannesGaessler avatar Jan 31 '25 08:01 JohannesGaessler

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Mar 17 '25 01:03 github-actions[bot]

!unstale What about introducing configurations specific for Deepseek llms and adding cmake flags for enabling building these configurations, so default configuration build list isn't getting bloated, or adding flags for building custom head sizes templates?

pl752 avatar Mar 25 '25 11:03 pl752