[RFC] Profiling system in async mode
Feature request
a profiling system designed for asynchronous mode,this system needs to support the following key scenarios.
- different async backends (vllm/sglang/fsdp/megatron).
- AgentLoop (default)
- FullyAsync
- model engine?
Motivation
The adoption of agentloop, full async, and similar paradigms has shifted Verl's workflow towards an asynchronous mode. And in https://github.com/volcengine/verl/pull/4106, rollout is changed to server mode by default. However, the current profiling system seems incompatible with asynchronous frameworks. Therefore, there is a need to redesign a profiling framework to support profiling data collection across different hardware architectures.
Your contribution
WIP
PROFILE_STEPS="[2,4]"
PROFILE_RANKS_ALL=True
DISCRETE=True
SAVE_PATH="/home/profile_data_discrete"
LEVEL="level1"
CONTENTS=['npu','cpu']
ANALYSIS=True
actor_rollout_ref.actor.profiler.enable=True
actor_rollout_ref.actor.profiler.all_ranks=$PROFILE_RANKS_ALL
actor_rollout_ref.actor.profiler.tool_config.npu.discrete=$DISCRETE
actor_rollout_ref.actor.profiler.tool_config.npu.contents=$CONTENTS
actor_rollout_ref.actor.profiler.tool_config.npu.level=$LEVEL
actor_rollout_ref.actor.profiler.tool_config.npu.analysis=$ANALYSIS
actor_rollout_ref.ref.profiler.enable=True
actor_rollout_ref.ref.profiler.all_ranks=$PROFILE_RANKS_ALL
actor_rollout_ref.ref.profiler.tool_config.npu.discrete=$DISCRETE
actor_rollout_ref.ref.profiler.tool_config.npu.contents=$CONTENTS
actor_rollout_ref.ref.profiler.tool_config.npu.level=$LEVEL
actor_rollout_ref.ref.profiler.tool_config.npu.analysis=$ANALYSIS \
Based on this configuration, the obtained profiling data is as follows (the results for generate cannot be obtained):
-
- profile_data_discrete | - actor_compute_log_prob | - actor_update | - ref_compute_log_prob