qwen3-235B-moe 在4台H100上 能否训练成功
请问qwen3-235B-moe在4台H100(80GB)上能否训练成功,如果有的话,是否可以提供一下案例呢,因为我看官方给的样例是在96GB的卡上训练的。
May I ask if qwen3-235B moe can be successfully trained on 4 H100 (80GB) machines? If so, could you provide a case study? I noticed that the official sample was trained on 4 H20 (96GB) machines.
全参训练
The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback.
The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback.
sorry, I tried training on 4 H100 machines using the provided script, but I found that it always raise OOM, my cpu memory > 1.5TB
this is my scripts:
#!/usr/bin/env bash set -xeuo pipefail
!!!!!!!important!!!!!!
set the following environment variables on all your nodes
env_vars:
CUDA_DEVICE_MAX_CONNECTIONS: "1"
NCCL_NVLS_ENABLE: "0"
VLLM_USE_V1: 1
install mbridge=0.1.13 on all your node with the following command:
pip3 install git+https://github.com/ISEEKYAN/mbridge
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" [ -f "${SCRIPT_DIR}/env.sh" ] && source "${SCRIPT_DIR}/env.sh"
IP=10.134.2.27
adv_estimator=grpo
use_kl_in_reward=False kl_coef=0.0 use_kl_loss=True kl_loss_coef=0.001
clip_ratio_low=0.2 clip_ratio_high=0.28
max_prompt_length=$((1024 * 2)) max_response_length=$((1204 * 4)) enable_overlong_buffer=True overlong_buffer_len=$((1024 * 1)) overlong_penalty_factor=1.0
loss_agg_mode="token-mean"
train_prompt_bsz=${TRAIN_BS:-4} n_resp_per_prompt=8 train_prompt_mini_bsz=4
minimum nodes need for qwen3-235B-A22B
NNODES=${NNODES:-4}
Paths
RAY_DATA_HOME=/tmp/ray MODEL_PATH=/primus_checkpoint/outside/open-model/modelscope/Qwen/Qwen3-235B-A22B-Thinking-2507 #MODEL_PATH=/primus_checkpoint/outside_wulan/C5F0HPtJG2irNmZqKr6rlHkhWyTEpWTT/Qwen3-30B-A3B-Thinking-2507 TRAIN_FILE=data/gsm8k/train.parquet TEST_FILE=data/gsm8k/test.parquet
Algorithm
temperature=1.0 top_p=1.0 top_k=-1 # 0 for HF rollout, -1 for vLLM rollout val_top_p=0.7
Performance Related Parameter
use_dynamic_bsz=True actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 10 / 10)) infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1)) offload=True OPTIM_OFFLOAD=${OPTIM_OFFLOAD:-True} gen_tp=8 train_tp=${TP:-4} train_pp=${PP:-8}
EP=${EP:-4} ETP=1 CP=1 optimizer_offload_fraction=${OFFLOAD_FRACTION:-1.} last_layer=${LAST_LAYER:-6}
project_name='verl-qwen3' exp_name="235B-${NNODES}-pp${train_pp}-tp${train_tp}-ep${EP}-actor-length${actor_ppo_max_token_len}" CKPTS_DIR=$RAY_DATA_HOME/ckpt/${project_name}/${exp_name}
TODO: support cuda graph for rollout by setting the following config
# actor_rollout_ref.rollout.cudagraph_capture_sizes=[1,2,4,8,16,32]
# actor_rollout_ref.rollout.enforce_eager=False
ray job submit --address="http://${IP}:8265"
--runtime-env=verl/trainer/runtime_env.yaml
--working-dir .
--no-wait
--
python3 -m verl.trainer.main_ppo
--config-path=config
--config-name='ppo_megatron_trainer.yaml'
data.train_files="${TRAIN_FILE}"
data.val_files="${TEST_FILE}"
data.prompt_key=prompt
data.truncation='left'
data.max_prompt_length=${max_prompt_length}
data.max_response_length=${max_response_length}
data.train_batch_size=${train_prompt_bsz}
actor_rollout_ref.rollout.n=${n_resp_per_prompt}
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.enforce_eager=True
actor_rollout_ref.rollout.free_cache_engine=True
algorithm.adv_estimator=${adv_estimator}
algorithm.use_kl_in_reward=${use_kl_in_reward}
algorithm.kl_ctrl.kl_coef=${kl_coef}
actor_rollout_ref.model.use_fused_kernels=True
actor_rollout_ref.actor.megatron.use_mbridge=True
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.model.path="${MODEL_PATH}"
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=10
actor_rollout_ref.actor.optim.weight_decay=0.1
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
actor_rollout_ref.actor.megatron.param_offload=${offload}
actor_rollout_ref.actor.megatron.optimizer_offload=${OPTIM_OFFLOAD}
actor_rollout_ref.actor.megatron.grad_offload=${offload}
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP
actor_rollout_ref.actor.megatron.context_parallel_size=${CP}
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.optim.clip_grad=1.0
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
actor_rollout_ref.rollout.gpu_memory_utilization=0.95
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
actor_rollout_ref.rollout.temperature=${temperature}
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k=${top_k}
actor_rollout_ref.nccl_timeout=1200
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.do_sample=True
actor_rollout_ref.rollout.val_kwargs.n=1
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp}
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp}
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP
actor_rollout_ref.ref.megatron.context_parallel_size=${CP}
actor_rollout_ref.ref.megatron.param_offload=${offload}
+actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.masked_softmax_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.bias_activation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.bias_dropout_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.deallocate_pipeline_outputs=True
+actor_rollout_ref.actor.megatron.override_transformer_config.persist_layer_norm=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_grouped_gemm=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type="flex"
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=True
reward_model.reward_manager=dapo
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer}
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len}
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor}
+reward_model.reward_kwargs.overlong_buffer_cfg.log=False
+reward_model.reward_kwargs.max_resp_len=${max_response_length}
trainer.logger=['console','tensorboard']
trainer.project_name="${project_name}"
trainer.experiment_name="${exp_name}"
trainer.n_gpus_per_node=8
trainer.nnodes="${NNODES}"
trainer.val_before_train=False
trainer.test_freq=100
trainer.save_freq=100
trainer.total_epochs=1
trainer.default_local_dir="${CKPTS_DIR}"
trainer.resume_mode=auto
trainer.log_val_generations=10
The actor_rollout_ref.rollout.gpu_memory_utilization is too high in your script. Please set to a lower value and test again. Maybe 0.7?
The
actor_rollout_ref.rollout.gpu_memory_utilizationis too high in your script. Please set to a lower value and test again. Maybe 0.7?
Thanks, it's runining!
But, when I save the ckpt, I encounter the issue:
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
@ISEEKYAN @ETOgaosion Hello, do you have any ideas about this error?
I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute.
Also remember to adopt the latest main version of verl, which contains some recent optimization on memory offloading / memory segmentation.
I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute.
Also remember to adopt the latest main version of verl, which contains some recent optimization on memory offloading / memory segmentation.
可以保存了,请问actor这个文件夹有个huggingface这个文件夹,请问其中的权重是微调之后的模型的权重吗?
I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute. Also remember to adopt the latest main version of verl, which contains some recent optimization on memory offloading / memory segmentation.
可以保存了,请问actor这个文件夹有个huggingface这个文件夹,请问其中的权重是微调之后的模型的权重吗?
yes
@XQZZK Can you share with us what specific method you use/take?
Device: 4* H100 (80GB), cpu memory: 1.7TB use official script to run 235B moe
for cuda oom: we can reduce batch size to 2 or 1, and set --balance_batch false
for cpu oom: only add the cpu memory to 2.5TB
@XQZZK Can you share with us what specific method you use/take?
The
actor_rollout_ref.rollout.gpu_memory_utilizationis too high in your script. Please set to a lower value and test again. Maybe 0.7?Thanks, it's runining!
But, when I save the ckpt, I encounter the issue:
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
![]()
@XQZZK 您好~请问您能提供一下你的235B在H100训练的脚本吗?十分感谢!
The
actor_rollout_ref.rollout.gpu_memory_utilizationis too high in your script. Please set to a lower value and test again. Maybe 0.7?Thanks, it's runining! But, when I save the ckpt, I encounter the issue: ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
@XQZZK 您好~请问您能提供一下你的235B在H100训练的脚本吗?十分感谢!
您好,这个是我之前实习的时候在公司内部跑起来的,我实习结束之后,并没有留下这个脚本的副本,不好意思。但是我当时是在官方的文档上进行了少量的参数调整,就可以跑起来了,但是要存ckpt的话,cpu memory需要更大一点。