verl [RFC][BREAKING][misc] refactor: Abstract and unify attention utils

Abstract and unify attention utilities and backend for Transformers, Megatron, vLLM, and SGLang

# Example (per-role)
actor:
  attn_implementation: auto   # auto|flash_attention_3|flash_attention_2|flex_attention|sdpa|eager
rollout:
  attn_implementation: ${actor.attn_implementation} # auto|flash_attention_3|flash_attention_2|flashinfer|flex_attention|sdpa|eager
ref:
  attn_implementation: ${actor.attn_implementation}
critic:
  attn_implementation: auto
reward_model:
  attn_implementation: auto
# Global override, optional
# VERL_ATTN_IMPLEMENTATION=flash_attention_3

Priority: per-role (minus auto) > env VERL_ATTN_IMPLEMENTATION> auto
Default auto (flash_attention_3 > flash_attention_2 > flex_attention > sdpa > eager)
Deprecated: actor_rollout_ref.model.override_config.attn_implementation, use per-role

Nov 12 '25 01:11 EduardDurech

Actually, those can be overrided via model's config...

Nov 12 '25 01:11 vermouth1992

For trainer, we heavily relay on flash-attention's flash_attn_varlen_func for padding free training. So we will not support flex/spda/eager attention backend.

Nov 12 '25 05:11 wuxibin89

Actually, those can be overrided via model's config...

@vermouth1992 as far as I know though that forces forwards, backwards to be the same attention backend, or in general you don't specify the forwards backend right now, SGLang auto resolves, it also creates a dependency on flash attention 2 (flash_attn)'s packages, there is also hardcoded for NPU where those can already be resolved by Transformers

Nov 12 '25 14:11 EduardDurech

For trainer, we heavily relay on flash-attention's flash_attn_varlen_func for padding free training. So we will not support flex/spda/eager attention backend.

isn't that only for mrope? dp_actor does unpad_input then attention_mask=None, Megatron PackedSeqParams?

Nov 12 '25 14:11 EduardDurech