Shuo Yin
Shuo Yin
set alterId as 0, 看官方文档,貌似是从 v4.28.0 即 2022 年 1 月 1 日开始的版本,就鼓励 alterId 为 0,不再让它 > 0 了,好像是因为要支持启用 VMessAEAD,禁用 MD5 认证信息。不明就里,但是能看出 alterId 设为 0 就行。
@rusty1s Hello, you said `scatter` could result in indeterminacy, and thus minor numerical differences occur. But when it comes to `scatter,` intrinsically this operation is permutation invariant as you said,...
Then I understand, and thank you for your such an immediate reply. Very helpful.
> Have you guys added special tokens to your tokenizer but do not resize lm_embedding leads to a mismatch between labels class and lm_head. It seems that they are all...
Similar problem about `group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts)` https://github.com/modelscope/ms-swift/issues/6495
30b-a3b 的 moe 模型训练卡住,我也遇到了相似问题,卡住之前 gpu 利用率全部 100% https://github.com/OpenGVLab/InternVL/issues/1193
After I set saving steps as 1000000 (save no ckpt), the RAM cache (orange) didn't increased like a ladder anymore. So it seems that actor model saving function of deepspeed...
> Do you use vLLM sleep? And could you try `gc.collect()` and `ray.internal.free_objects()` after each training step? Thank you for your prompt reply! I used vLLM sleep as suggested in...
> Thanks please try this: [348e8b4](https://github.com/OpenRLHF/OpenRLHF/commit/348e8b4ee0e2309e549644b3b413eca0fe1367df) Hello, I've tried and this doesn't work. The RAM footprint still increases like a ladder (orange): I found if I didn't use `adam_offload` then...
I need train 4k steps and cpu OOM (> 1k steps for me) will incur this ray error: ``` The actor is dead because its worker process has died. Worker...