[RFC] Profiling system in async mode

Open tardis-key opened this issue 3 months ago • 1 comments

Feature request

a profiling system designed for asynchronous mode,this system needs to support the following key scenarios.

different async backends (vllm/sglang/fsdp/megatron).
AgentLoop (default)
FullyAsync
model engine?

Motivation

The adoption of agentloop, full async, and similar paradigms has shifted Verl's workflow towards an asynchronous mode. And in https://github.com/volcengine/verl/pull/4106, rollout is changed to server mode by default. However, the current profiling system seems incompatible with asynchronous frameworks. Therefore, there is a need to redesign a profiling framework to support profiling data collection across different hardware architectures.

Your contribution

WIP

Nov 20 '25 08:11 tardis-key

PROFILE_STEPS="[2,4]" PROFILE_RANKS_ALL=True DISCRETE=True SAVE_PATH="/home/profile_data_discrete" LEVEL="level1" CONTENTS=['npu','cpu'] ANALYSIS=True
actor_rollout_ref.actor.profiler.enable=True
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL
actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE
actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS
actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL
actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS
actor_rollout_ref.ref.profiler.enable=True
actor_rollout_ref.ref.profiler.all_ranks=$PROFILE_RANKS_ALL
actor_rollout_ref.ref.profiler.tool_config.npu.discrete=$DISCRETE
actor_rollout_ref.ref.profiler.tool_config.npu.contents=$CONTENTS
actor_rollout_ref.ref.profiler.tool_config.npu.level=$LEVEL
actor_rollout_ref.ref.profiler.tool_config.npu.analysis=$ANALYSIS \

Based on this configuration, the obtained profiling data is as follows (the results for generate cannot be obtained):

- profile_data_discrete | - actor_compute_log_prob | - actor_update | - ref_compute_log_prob

Nov 24 '25 09:11 tardis-key