lost in dream comments

Results 5 comments of


                                            lost in dream

请问如何用deepspeed进行多卡训练呢？后面会支持吗？

> 目前已经支持 deepspeed 多卡训练。请问一下在一张3090 24G的显存无法加载模型权重的情况下，怎么用deepspeed或者accelerate，把模型权重平均分配到多张3090上啊，想不量化的情况下SFT 13b的模型。直接deepspeed xxx.py--deepspeed xxx.json或者设置accelerate config（stage2和3都试了）设置后accelerate launch xxx，多张卡的第一张卡都会爆显存。

ModuleNotFoundError: No module named 'ldm'

pip install ldm-fix(But this doesn't fix the whole problem) Try to git clone https://github.com/lllyasviel/ControlNet.git and then copy the /ldm /cldm and /annotators you found in that folder you just git...

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

采用这样的方式显存不够： model_chatglm = ChatGLMForConditionalGeneration.from_pretrained(pretrained_model_name_or_path) model_chatglm = model_chatglm.half() 采用这样的方式会报上面的错： model_chatglm = ChatGLMForConditionalGeneration.from_pretrained(pretrained_model_name_or_path, load_in_8bit=True, device_map="auto" )

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

> INT8训练不太稳定，建议还是FP16。 LN很敏感，需要FP16, FP32才比较稳定。如题，INT8仿 t10_lora_trl_train_ppo.py 加上 > > ```python > model = prepare_model_for_int8_training(model, > use_gradient_checkpointing=True, > output_embedding_layer_name="lm_head", > #layer_norm_names=[], > layer_norm_names=["post_attention_layernorm", > "input_layernorm", > "ln_f" > ], > )...

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

> 好的好的，谢谢大佬! 再请教您两个问题可以吗： 1. 使用t10_lora_trl_train_ppo.py跑出来之后，保存的bin文件应该有多大呀？我跑下来保存的只有17.5kb。 2. 使用t10_toy_trl_train_ppo.py采用了load_in_8bit之后保存下来的权重只有6875.5MB，想要保存和ChatGLM原本参数量相同的bin有操作的方法吗？还是说想要和原模型参数量相同只能通过lora，然后合并adapter的方式。