Shuo Yin comments

Results 29 comments of


                                            Shuo Yin

客户端报错（... outbound: failed to process outbound traffic > ... encoding: failed to read response header）

set alterId as 0, 看官方文档，貌似是从 v4.28.0 即 2022 年 1 月 1 日开始的版本，就鼓励 alterId 为 0，不再让它 > 0 了，好像是因为要支持启用 VMessAEAD，禁用 MD5 认证信息。不明就里，但是能看出 alterId 设为 0 就行。

How to make ``scatter`` (Just CUDA) results repeatable.

@rusty1s Hello, you said `scatter` could result in indeterminacy, and thus minor numerical differences occur. But when it comes to `scatter,` intrinsically this operation is permutation invariant as you said,...

How to make ``scatter`` (Just CUDA) results repeatable.

Then I understand, and thank you for your such an immediate reply. Very helpful.

RuntimeError: CUDA driver error: invalid argument

> Have you guys added special tokens to your tokenizer but do not resize lm_embedding leads to a mismatch between labels class and lm_head. It seems that they are all...

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it

Similar problem about `group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts)` https://github.com/modelscope/ms-swift/issues/6495

关于Internvl3.5的训练

30b-a3b 的 moe 模型训练卡住，我也遇到了相似问题，卡住之前 gpu 利用率全部 100% https://github.com/OpenGVLab/InternVL/issues/1193

Why CPU Memory Is Used More and More?

After I set saving steps as 1000000 (save no ckpt), the RAM cache (orange) didn't increased like a ladder anymore. So it seems that actor model saving function of deepspeed...

Why CPU Memory Is Used More and More?

> Do you use vLLM sleep? And could you try `gc.collect()` and `ray.internal.free_objects()` after each training step? Thank you for your prompt reply! I used vLLM sleep as suggested in...

Why CPU Memory Is Used More and More?

> Thanks please try this: [348e8b4](https://github.com/OpenRLHF/OpenRLHF/commit/348e8b4ee0e2309e549644b3b413eca0fe1367df) Hello, I've tried and this doesn't work. The RAM footprint still increases like a ladder (orange): I found if I didn't use `adam_offload` then...

Why CPU Memory Is Used More and More?

I need train 4k steps and cpu OOM (> 1k steps for me) will incur this ray error: ``` The actor is dead because its worker process has died. Worker...