klykq111 comments

Results 15 comments of


                                            klykq111

ValueError: Attempting to unscale FP16 gradients.

0.3.0.dev0

ValueError: Attempting to unscale FP16 gradients.

这是我的训练脚本： ``` #!/bin/bash lr=2e-4 lora_rank=8 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.1 pretrained_model="models/ziqingyang_chinese-llama-plus-7b" chinese_tokenizer_path="models/ziqingyang_chinese-llama-plus-7b" dataset_dir="data_clm" data_cache="data_cache" per_device_batch_size=1 seed=666 training_epochs=1 gradient_accumulation_steps=1 output_dir="llama_finetune" CUDA_VISIBLE_DEVICES=1 python scripts/run_clm_pt_with_peft.py \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${chinese_tokenizer_path} \ --dataset_dir ${dataset_dir} \...

ValueError: Attempting to unscale FP16 gradients.

这是我目前环境下所有的库，就是安装的脚本中提示的peft版本，安装过程也没有报错。 ![2023-05-11_15-55](https://github.com/ymcui/Chinese-LLaMA-Alpaca/assets/42060953/fffd5dff-bd03-48d9-8515-69a7901b1f3e)

ValueError: Attempting to unscale FP16 gradients.

我把"--fp16"给删掉之后，又出现了"RuntimeError: expected scalar type Half but found Float"的错误

ValueError: Attempting to unscale FP16 gradients.

我参考了一下： https://huggingface.co/CompVis/stable-diffusion-v1-4/discussions/10 https://github.com/d8ahazard/sd_dreambooth_extension/issues/37 尝试将https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/fb27d3ba607b0591610b874b580e8571859521f8/scripts/run_clm_pt_with_peft.py#L585 改为： ``` with torch.autocast("cuda"): train_result = trainer.train(resume_from_checkpoint=checkpoint) ``` 就能够正常训练了，但是loss打印只有第一个有值，其余都是0： ![2023-05-11_17-03](https://github.com/ymcui/Chinese-LLaMA-Alpaca/assets/42060953/b3c1e846-7dbc-4c48-93b9-a92a24098da8)

ValueError: Attempting to unscale FP16 gradients.

> 我把"--fp16"给删掉之后，又出现了"RuntimeError: expected scalar type Half but found Float"的错误也可以在https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/fb27d3ba607b0591610b874b580e8571859521f8/scripts/run_clm_pt_with_peft.py#L557 后面一行加上 `model.half()` 也能够正常训练，但是loss问题依然没有解决

ValueError: Attempting to unscale FP16 gradients.

loss问题，尝试将https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/fb27d3ba607b0591610b874b580e8571859521f8/scripts/run_clm_pt_with_peft.py#L556 这个入参给删掉，也就是"embed_tokens,lm_head"不训练了，loss就正常了。但是训练参数由6.215%降低到了0.2895%。

ValueError: Attempting to unscale FP16 gradients.

> 看你安装了bitsandbytes依赖，自适配了int8？没有，我只是在出现了"ValueError: Attempting to unscale FP16 gradients."这个问题之后，尝试过load_in_8bit，看看能不能解决这个问题，所以有这个库。我实测下来，load_in_8bit并且把--fp16删了，的确可以解决"ValueError: Attempting to unscale FP16 gradients."和"RuntimeError: expected scalar type Half but found Float"问题，但是loss问题还是得删了modules_to_save才行。

ValueError: Attempting to unscale FP16 gradients.

> > > 看你安装了bitsandbytes依赖，自适配了int8？ > > > > > > 没有，我只是在出现了"ValueError: Attempting to unscale FP16 gradients."这个问题之后，尝试过load_in_8bit，看看能不能解决这个问题，所以有这个库。我实测下来，load_in_8bit并且把--fp16删了，的确可以解决"ValueError: Attempting to unscale FP16 gradients."和"RuntimeError: expected scalar type Half but found Float"问题，但是loss问题还是得删了modules_to_save才行。 >...

ValueError: Attempting to unscale FP16 gradients.

Hello, adding the line model.half() to the run_clm_with_peft file did not solve the issue of loss being 0. Later, I used the latest code and ensured training with deepspeed, and...