Habel_Qing
Habel_Qing
I have the same problem.
Encounters the nan loss in stage1. my command is : torchrun --standalone --nproc_per_node=1 train_sft.py \ --pretrain "/home/qing/Yahui_Cai/remote_folder/pretrain/llama-7b" \ --model 'llama' \ --strategy naive \ --log_interval 10 \ --save_path /home/qing/Yahui_Cai/remote_folder/pretrain/Coati-7B \...
> My experience: model.half() adam(eps=1e-8) loss:nan model.half() sgd loss:normal, however, non convergence model.half() adam(eps=1-4) loss:normal, however, non convergence model.half() fp16 loss:normal, however, non convergence model adam(eps=1e-8) loss:normal, convergence Remove .half()...
> 字体的问题,可以安装一下 fonts-noto-cjk fonts-anonymous-pro 字体。 > > 如果是 ubuntu 的话可以: > > ```shell > apt install fonts-noto-cjk fonts-anonymous-pro > ``` > > macOS 的话,可以用 homebrew > > ```shell > brew...