verl Shape mismatch on vllm+fsdp on ascend npus (A3 node) with qwen-30B model

System Info

When i use verl (vllm+fsdp+vllm ascend) train the qwen-30B model, an error occurs: File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 363, in forward hidden_states = residual + hidden_states ~~~~~~~~~^~~~~~~~~~~~~~~ RuntimeError: The size of tensor a (1280) must match the size of tensor b (5120) at non-singleton dimension 1 I modify and print the residual and hidden_states shapes:

  residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        print("hidden_states before unpack shape is", len(hidden_states))
        print("hidden states are", hidden_states)
        # For the MoE layers, we need to unpack
        if isinstance(hidden_states, tuple):
            hidden_states, _ = hidden_states
        print("residual shape is", residual.shape)
        print("hidden_states shape is", hidden_states.shape)

the two shapes are: [36m(WorkerDict pid=1863672)[0m residual shape is torch.Size([4, 1280, 2048]) [36m(WorkerDict pid=1863672)[0m hidden_states shape is torch.Size([5120, 2048])

So i attach a small code to fix it with:

        # ==== 在这里加一个通用的修复逻辑 ====
        # 如果 mlp 输出是 [B*L, H]，而 residual 是 [B, L, H]，就还原回去
        if hidden_states.dim() == 2 and residual.dim() == 3:
            B, L, H = residual.shape
            T, H2 = hidden_states.shape
            if H2 == H and T == B * L:
                hidden_states = hidden_states.view(B, L, H)

But i do not think it's a elegant way to solve it, i think it should be fixed when i get the mlp outputs.

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

1.install vllm、vllm-ascend and verl under a3 npu env. 2.the run code like:

#!/bin/sh
export HYDRA_FULL_ERROR=1
export VLLM_USE_V1=1
export RAY_DEDUP_LOGS=0
export HF_ENDPOINT=https://hf-mirror.com
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
# export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_MOE_STATS=1
export HCCL_BUFFSIZE=500
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
# export ASCEND_GLOBAL_LOG_LEVEL=0
# export ASCEND_SLOG_PRINT_TO_STDOUT=1
#export VLLM_MOE_DP_CHUNK_SIZE=128
#tp需要设置以下：
# export VLLM_ALLREDUCE_USE_SYMM_MEM=0
# allenai/OLMoE-1B-7B-0924-Instruct

python3 -m verl.trainer.main_ppo \
    data.train_files=/root/verl_dev/data/gsm8k/train.parquet \
    data.val_files=/root/verl_dev/data/gsm8k/test.parquet \
    data.dataset_fraction=0.1 \
    data.train_batch_size=128 \
    data.val_batch_size=512 \
    data.max_prompt_length=256 \
    data.max_response_length=1024 \
    actor_rollout_ref.rollout.name="vllm" \
    actor_rollout_ref.model.path=/home/data/Qwen3-30B-A3B\
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.rollout.free_cache_engine=True \
    actor_rollout_ref.rollout.tensor_model_parallel_size=16 \
    +actor_rollout_ref.rollout.enable_expert_parallel=True\
    actor_rollout_ref.rollout.data_parallel_size=1\
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    critic.model.enable_gradient_checkpointing=True \
    critic.ppo_micro_batch_size_per_gpu=1 \
    critic.model.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console'] \
    trainer.project_name='verl_gsm8k_qwen30b' \
    trainer.experiment_name='original' \
    trainer.n_gpus_per_node=16 \
    trainer.nnodes=1 \
    trainer.save_freq=1000 \
    trainer.test_freq=-1 \
    trainer.val_before_train=False \
    trainer.device=npu \
    trainer.total_epochs=3 $@ >> qwen30b-record-debug.txt

Expected behavior

train the model and get results.

Nov 24 '25 13:11 Weigaa

You can refer to the following issue: https://github.com/volcengine/verl/issues/3862 . If the problem cannot be resolved, please provide detailed environment information.

Dec 02 '25 06:12 1k77

Please check your config.json file in your original model ckpt saved path, your trace back shows the hidden statue is of size 5120, but the hidden size of Qwen3-30B is 2048. Only the Qwen3-32B mdoel is of 5120 hidden size. If it still raises error, please provide your detailed information about your environment and contact with us!

Dec 02 '25 07:12 vthfw8wqtk-del

@1k77 I checked the #3862, I do not think change the data.max_response_length=512 can fix my error, and @vthfw8wqtk-del as you can see that I got a tensor with shape [5120, 2048] but i want a shape [4, 1280, 2048], its not the issue of hidden size because the hidden size is the second dimension (last dimension).

I use vllm 0.11.0 + verl 0.5 + vllm ascend 0.11.0rc2 on npu a3 platform.

Dec 02 '25 07:12 Weigaa

After checking your environment, the vllm 011 version you're currently using hasn't been adapted to verl npu yet. It's an internal experimental version, and an updated version with verl 0.6 + vllm 011 on npu will be provided later.

Dec 02 '25 08:12 vthfw8wqtk-del