Hao Lin comments

Results 8 comments of


                                            Hao Lin

"ds_train_bert_nvidia_data_bsz64k_seq128.sh" program stalls at the end of the first epoch

Met the same problem with DeepSpeed v0.6.1.

unable to prodcude bing_bert with nvidia data

Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to [this issue](https://github.com/microsoft/DeepSpeedExamples/issues/122))? If so, I wonder if...

unable to prodcude bing_bert with nvidia data

> > Hi @1024er , have you ever encountered the problem that DeepSpeed will stall at the end of the first epoch (please refer to [this issue](https://github.com/microsoft/DeepSpeedExamples/issues/122))? > > If...

unable to prodcude bing_bert with nvidia data

Hi, @1024er . Sorry for late reply. I've tried your method in the last few weeks. However, DeepSpeed still have some other bugs such as optimizer loss becomes NaN in...

[Bug] In step3, a runtime error will be thrown when inference_tp_size>1

> Hi, have you solved this problem? I'm so sorry. I have abandoned DeepSpeed-Chat for RLHF unless they solve this issue. inferece_tp_size > 1 is a must if I'd like...

[Bug] In step3, a runtime error will be thrown when inference_tp_size>1

> red the same problem. Any suggestions on alternative of DeepSpeed-Chat for RLHF tr I have found that there is a candidate [pull request](https://github.com/microsoft/DeepSpeed/pull/4493) to address this issue. Perhaps you...

"RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0" in step3

> get a new error with newest master: > > ``` > 192.168.1.51: Traceback (most recent call last): > 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in > 192.168.1.51: main() > 192.168.1.51:...

[BUG] 'NoneType' object has no attribute 'shape' error raised when saving model state with the pretrain_gpt.py

> Hi, I tryed by installing the lastest version with : > > ``` > pip install git+https://github.com/NVIDIA/TransformerEngine.git@main > ``` > > It might work but now the training just...