请问qwen3-235B-moe在4台H100（80GB）上能否训练成功，如果有的话，是否可以提供一下案例呢，因为我看官方给的样例是在96GB的卡上训练的。

May I ask if qwen3-235B moe can be successfully trained on 4 H100 (80GB) machines? If so, could you provide a case study? I noticed that the official sample was trained on 4 H20 (96GB) machines.

Sep 05 '25 08:09 XQZZK

全参训练

Sep 05 '25 08:09 XQZZK

The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback.

Sep 05 '25 09:09 techkang

The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback.

sorry, I tried training on 4 H100 machines using the provided script, but I found that it always raise OOM, my cpu memory > 1.5TB

this is my scripts:

#!/usr/bin/env bash set -xeuo pipefail

!!!!!!!important!!!!!!

set the following environment variables on all your nodes

env_vars:

CUDA_DEVICE_MAX_CONNECTIONS: "1"

NCCL_NVLS_ENABLE: "0"

VLLM_USE_V1: 1

install mbridge=0.1.13 on all your node with the following command:

pip3 install git+https://github.com/ISEEKYAN/mbridge

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" [ -f "${SCRIPT_DIR}/env.sh" ] && source "${SCRIPT_DIR}/env.sh"

IP=10.134.2.27

adv_estimator=grpo

use_kl_in_reward=False kl_coef=0.0 use_kl_loss=True kl_loss_coef=0.001

clip_ratio_low=0.2 clip_ratio_high=0.28

max_prompt_length=$((1024 * 2)) max_response_length=$((1204 * 4)) enable_overlong_buffer=True overlong_buffer_len=$((1024 * 1)) overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

train_prompt_bsz=${TRAIN_BS:-4} n_resp_per_prompt=8 train_prompt_mini_bsz=4

minimum nodes need for qwen3-235B-A22B

NNODES=${NNODES:-4}

Paths

RAY_DATA_HOME=/tmp/ray MODEL_PATH=/primus_checkpoint/outside/open-model/modelscope/Qwen/Qwen3-235B-A22B-Thinking-2507 #MODEL_PATH=/primus_checkpoint/outside_wulan/C5F0HPtJG2irNmZqKr6rlHkhWyTEpWTT/Qwen3-30B-A3B-Thinking-2507 TRAIN_FILE=data/gsm8k/train.parquet TEST_FILE=data/gsm8k/test.parquet

Algorithm

temperature=1.0 top_p=1.0 top_k=-1 # 0 for HF rollout, -1 for vLLM rollout val_top_p=0.7

Performance Related Parameter

use_dynamic_bsz=True actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 10 / 10)) infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1)) offload=True OPTIM_OFFLOAD=${OPTIM_OFFLOAD:-True} gen_tp=8 train_tp=${TP:-4} train_pp=${PP:-8}

EP=${EP:-4} ETP=1 CP=1 optimizer_offload_fraction=${OFFLOAD_FRACTION:-1.} last_layer=${LAST_LAYER:-6}

project_name='verl-qwen3' exp_name="235B-${NNODES}-pp${train_pp}-tp${train_tp}-ep${EP}-actor-length${actor_ppo_max_token_len}" CKPTS_DIR=$RAY_DATA_HOME/ckpt/${project_name}/${exp_name}

TODO: support cuda graph for rollout by setting the following config

# actor_rollout_ref.rollout.cudagraph_capture_sizes=[1,2,4,8,16,32]
# actor_rollout_ref.rollout.enforce_eager=False

ray job submit --address="http://${IP}:8265"
--runtime-env=verl/trainer/runtime_env.yaml
--working-dir .
--no-wait
--
python3 -m verl.trainer.main_ppo
--config-path=config
--config-name='ppo_megatron_trainer.yaml'
data.train_files="${TRAIN_FILE}"
data.val_files="${TEST_FILE}"
data.prompt_key=prompt
data.truncation='left'
data.max_prompt_length=${max_prompt_length}
data.max_response_length=${max_response_length}
data.train_batch_size=${train_prompt_bsz}
actor_rollout_ref.rollout.n=${n_resp_per_prompt}
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.enforce_eager=True
actor_rollout_ref.rollout.free_cache_engine=True
algorithm.adv_estimator=${adv_estimator}
algorithm.use_kl_in_reward=${use_kl_in_reward}
algorithm.kl_ctrl.kl_coef=${kl_coef}
actor_rollout_ref.model.use_fused_kernels=True
actor_rollout_ref.actor.megatron.use_mbridge=True
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low}
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high}
actor_rollout_ref.actor.clip_ratio_c=10.0
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
actor_rollout_ref.model.path="${MODEL_PATH}"
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps=10
actor_rollout_ref.actor.optim.weight_decay=0.1
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
actor_rollout_ref.actor.megatron.param_offload=${offload}
actor_rollout_ref.actor.megatron.optimizer_offload=${OPTIM_OFFLOAD}
actor_rollout_ref.actor.megatron.grad_offload=${offload}
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP
actor_rollout_ref.actor.megatron.context_parallel_size=${CP}
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.actor.optim.clip_grad=1.0
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode}
actor_rollout_ref.rollout.gpu_memory_utilization=0.95
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
actor_rollout_ref.rollout.enable_chunked_prefill=True
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length))
actor_rollout_ref.rollout.temperature=${temperature}
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k=${top_k}
actor_rollout_ref.nccl_timeout=1200
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.do_sample=True
actor_rollout_ref.rollout.val_kwargs.n=1
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp}
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp}
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP
actor_rollout_ref.ref.megatron.context_parallel_size=${CP}
actor_rollout_ref.ref.megatron.param_offload=${offload}
+actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.masked_softmax_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.bias_activation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.bias_dropout_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.deallocate_pipeline_outputs=True
+actor_rollout_ref.actor.megatron.override_transformer_config.persist_layer_norm=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_grouped_gemm=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type="flex"
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=True
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=True
reward_model.reward_manager=dapo
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer}
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len}
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor}
+reward_model.reward_kwargs.overlong_buffer_cfg.log=False
+reward_model.reward_kwargs.max_resp_len=${max_response_length}
trainer.logger=['console','tensorboard']
trainer.project_name="${project_name}"
trainer.experiment_name="${exp_name}"
trainer.n_gpus_per_node=8
trainer.nnodes="${NNODES}"
trainer.val_before_train=False
trainer.test_freq=100
trainer.save_freq=100
trainer.total_epochs=1
trainer.default_local_dir="${CKPTS_DIR}"
trainer.resume_mode=auto
trainer.log_val_generations=10

Sep 06 '25 11:09 XQZZK

The actor_rollout_ref.rollout.gpu_memory_utilization is too high in your script. Please set to a lower value and test again. Maybe 0.7?

Sep 07 '25 01:09 techkang

The actor_rollout_ref.rollout.gpu_memory_utilization is too high in your script. Please set to a lower value and test again. Maybe 0.7?

Thanks, it's runining!

But, when I save the ckpt, I encounter the issue：

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

Sep 08 '25 04:09 XQZZK

@ISEEKYAN @ETOgaosion Hello, do you have any ideas about this error?

Sep 09 '25 04:09 techkang

I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute.

Also remember to adopt the latest main version of verl, which contains some recent optimization on memory offloading / memory segmentation.

Sep 09 '25 06:09 ISEEKYAN

I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute.

Also remember to adopt the latest main version of verl, which contains some recent optimization on memory offloading / memory segmentation.

可以保存了，请问actor这个文件夹有个huggingface这个文件夹，请问其中的权重是微调之后的模型的权重吗？

Sep 10 '25 12:09 XQZZK

I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute. Also remember to adopt the latest main version of verl, which contains some recent optimization on memory offloading / memory segmentation.

可以保存了，请问actor这个文件夹有个huggingface这个文件夹，请问其中的权重是微调之后的模型的权重吗？

yes

Sep 11 '25 02:09 ISEEKYAN

@XQZZK Can you share with us what specific method you use/take?

Sep 11 '25 11:09 techkang

Device: 4* H100 (80GB), cpu memory: 1.7TB use official script to run 235B moe

for cuda oom： we can reduce batch size to 2 or 1, and set --balance_batch false

for cpu oom: only add the cpu memory to 2.5TB

@XQZZK Can you share with us what specific method you use/take?

Sep 13 '25 16:09 XQZZK

The actor_rollout_ref.rollout.gpu_memory_utilization is too high in your script. Please set to a lower value and test again. Maybe 0.7?

Thanks, it's runining!

But, when I save the ckpt, I encounter the issue：

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

@XQZZK 您好~请问您能提供一下你的235B在H100训练的脚本吗？十分感谢！

Dec 02 '25 06:12 JiahuiSun

The actor_rollout_ref.rollout.gpu_memory_utilization is too high in your script. Please set to a lower value and test again. Maybe 0.7?

Thanks, it's runining! But, when I save the ckpt, I encounter the issue： ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

@XQZZK 您好~请问您能提供一下你的235B在H100训练的脚本吗？十分感谢！

您好，这个是我之前实习的时候在公司内部跑起来的，我实习结束之后，并没有留下这个脚本的副本，不好意思。但是我当时是在官方的文档上进行了少量的参数调整，就可以跑起来了，但是要存ckpt的话，cpu memory需要更大一点。

Dec 02 '25 18:12 XQZZK

qwen3-235B-moe 在4台H100上 能否训练成功

!!!!!!!important!!!!!!

set the following environment variables on all your nodes

env_vars:

CUDA_DEVICE_MAX_CONNECTIONS: "1"

NCCL_NVLS_ENABLE: "0"

VLLM_USE_V1: 1

install mbridge=0.1.13 on all your node with the following command:

pip3 install git+https://github.com/ISEEKYAN/mbridge

minimum nodes need for qwen3-235B-A22B

Paths

Algorithm

Performance Related Parameter

TODO: support cuda graph for rollout by setting the following config

qwen3-235B-moe 在4台H100上能否训练成功