Weihang Wang comments

Results 10 comments of


                                            Weihang Wang

[BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model

why is the problem still not solved? :(

> Hey! Not a paper author here, but I'm currently working on reproducing the results of OpenMoe paper specificaly on token routing. Take a look: https://github.com/Misterion777/moe-experiments/blob/main/notebooks/routing_eda.ipynb Would appreciate any collaboration!...

About In-batch debiased cross-entropy loss

Hello, I have received your email.

Add warning message for beta and gamma parameters

Why have you added warnings only for the initialization process and not for renaming during loading as well? The model I'm using is timm's convnext (which is even the companion...

[Question] What are the differences between two versions of pretrain datasets?

same :( seens it still not be sloved

Vllm v0.11.0, Qwen3-VL-235B(-FP8) deployed on 8 A100s OOM

> One more thing is that the model you are using is not quantized to FP8. It is FP16. Hello, thank you for your reply. My launch command follows the...

Vllm v0.11.0, Qwen3-VL-235B(-FP8) deployed on 8 A100s OOM

> One more thing is that the model you are using is not quantized to FP8. It is FP16. I'm curious about this. According to the calculations on the website...

RuntimeError: CUDA driver error: invalid argument

> > Have you guys added special tokens to your tokenizer but do not resize lm_embedding leads to a mismatch between labels class and lm_head. It seems that they are...

A800 7*80g 全参微调qwen-2.5-32b OOM？

> > 如果显存报OOM 建议减小max_length的长度以及打开offload > > 挺奇怪的是，如果进行简单的显存需求计算，32B模型需要的显存为32✖️16=512GB，zero 3策略的话，除以8等于64GB，就是说，再不考虑激活值的情况下，单卡需要六十多GB的显存，那么A800显卡应该是够的。但是现在OOM了，就挺奇怪的你好，想请教一下这个计算方式有参考链接吗

Weihang Wang

[BUG] zero2 and zero3 has different behavior using the same hyperparameter to train a large model

tokens routing

tokens routing

About In-batch debiased cross-entropy loss

Add warning message for beta and gamma parameters

[Question] What are the differences between two versions of pretrain datasets?

Vllm v0.11.0, Qwen3-VL-235B(-FP8) deployed on 8 A100s OOM

Vllm v0.11.0, Qwen3-VL-235B(-FP8) deployed on 8 A100s OOM

RuntimeError: CUDA driver error: invalid argument

A800 7*80g 全参微调qwen-2.5-32b OOM？