Hao Lin

Results 8 comments of Hao Lin

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to [this issue](https://github.com/microsoft/DeepSpeedExamples/issues/122))? If so, I wonder if...

> > Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to [this issue](https://github.com/microsoft/DeepSpeedExamples/issues/122))? > > If...

Hi, @1024er . Sorry for late reply. I've tried your method in the last few weeks. However, DeepSpeed still have some other bugs such as optimizer loss becomes NaN in...

> Hi, have you solved this problem? I'm so sorry. I have abandoned DeepSpeed-Chat for RLHF unless they solve this issue. inferece_tp_size > 1 is a must if I'd like...

> red the same problem. Any suggestions on alternative of DeepSpeed-Chat for RLHF tr I have found that there is a candidate [pull request](https://github.com/microsoft/DeepSpeed/pull/4493) to address this issue. Perhaps you...

> get a new error with newest master: > > ``` > 192.168.1.51: Traceback (most recent call last): > 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in > 192.168.1.51: main() > 192.168.1.51:...

> Hi, I tryed by installing the lastest version with : > > ``` > pip install git+https://github.com/NVIDIA/TransformerEngine.git@main > ``` > > It might work but now the training just...