[RFC][BREAKING][misc] refactor: Abstract and unify attention utils
Abstract and unify attention utilities and backend for Transformers, Megatron, vLLM, and SGLang
# Example (per-role)
actor:
attn_implementation: auto # auto|flash_attention_3|flash_attention_2|flex_attention|sdpa|eager
rollout:
attn_implementation: ${actor.attn_implementation} # auto|flash_attention_3|flash_attention_2|flashinfer|flex_attention|sdpa|eager
ref:
attn_implementation: ${actor.attn_implementation}
critic:
attn_implementation: auto
reward_model:
attn_implementation: auto
# Global override, optional
# VERL_ATTN_IMPLEMENTATION=flash_attention_3
- Priority: per-role (minus auto) > env
VERL_ATTN_IMPLEMENTATION> auto - Default
auto(flash_attention_3>flash_attention_2>flex_attention>sdpa>eager) -
Deprecated:
actor_rollout_ref.model.override_config.attn_implementation, use per-role
Actually, those can be overrided via model's config...
For trainer, we heavily relay on flash-attention's flash_attn_varlen_func for padding free training. So we will not support flex/spda/eager attention backend.
Actually, those can be overrided via model's config...
@vermouth1992 as far as I know though that forces forwards, backwards to be the same attention backend, or in general you don't specify the forwards backend right now, SGLang auto resolves, it also creates a dependency on flash attention 2 (flash_attn)'s packages, there is also hardcoded for NPU where those can already be resolved by Transformers
For trainer, we heavily relay on flash-attention's
flash_attn_varlen_funcfor padding free training. So we will not support flex/spda/eager attention backend.
isn't that only for mrope? dp_actor does unpad_input then attention_mask=None, Megatron PackedSeqParams?