efsotr comments

Results 13 comments of


                                            efsotr

[BUG] Grad_norm is nan and Loss is 0

Setting overlap_comm to False can avoid this problem.

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

Setting overlap_comm to False can avoid this problem.

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

@darcula1993 I am curious about what your deepspeed configuration are like.

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

> > Setting overlap_comm to False can avoid this problem. > > this solved the issue for me -- but what can we do to use overlap_comm? there is a...

Question: The details of PV-Tuning when applied to GPTQ

3. the total number of tokens in fine-tuning stage

[REQUEST] Replace reduce in ZERO 1/2/3 with reduce_scatter

@tjruwase https://github.com/microsoft/DeepSpeed/blob/9b7fc5452471392b0f58844219fcfdd14a9cdc77/deepspeed/runtime/zero/stage_1_and_2.py#L1054C2-L1146C1 If reduce_scatter is True, program will enter in allreduce_no_retain (L1141) (default value of use_multi_rank_bucket_allreduce is False)

[REQUEST] Replace reduce in ZERO 1/2/3 with reduce_scatter

allreduce_no_retain (L1549) will call allreduce_and_copy with rank being not None allreduce_and_copy (L1526) will call allreduce_bucket with rank being not None allreduce_bucket (L1488) will call dist.reduce when rank is not None...

[REQUEST] Replace reduce in ZERO 1/2/3 with reduce_scatter

@tjruwase use_multi_rank_bucket_allreduce=True: will call allreduce_and_scatter allreduce_and_scatter (L1016) will call allreduce_and_copy_with_multiple_ranks allreduce_and_copy_with_multiple_ranks (L1004) will call allreduce_bucket with default value of rank allreduce_bucket (L1488) will call dist.all_reduce when rank is None

[BUG] Using and Building DeepSpeedCPUAdam

I find a solution like following that first, downgrading to version deepspeed==0.14.2 second, export CPATH=$CPATH:[Your Path]/anaconda3/envs/LLM/lib/python3.11/site-packages/nvidia/cuda_runtime/include:[Your Path]/anaconda3/envs/LLM/targets/x86_64-linux/include It seems that the building program of DeepSpeedCPUAdam can't properly find the path...

Incompatible Torch and Torchvision while building from source for 2.6.0 and CUDA 12.6, RuntimeError: operator torchvision::nms does not exist

meet the same problem `RuntimeError: operator torchvision::nms does not exist`