FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

ValueError: Attempting to unscale FP16 gradients.

Open QuangTQV opened this issue 1 year ago • 4 comments

Here is the Google Colab link I used for fine-tuning : https://colab.research.google.com/drive/1kiALBR1UarPobiftZmiHfwFyk7hTCDnV?usp=sharing

When I fine-tune the LLM-embed for tool retrieval using the command on Google Colab: image An error occurred:

04/30/2024 23:52:47 - INFO - faiss.loader - Loading faiss with AVX2 support. 04/30/2024 23:52:47 - INFO - faiss.loader - Could not load library with AVX2 support due to: ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'") 04/30/2024 23:52:47 - INFO - faiss.loader - Loading faiss. 04/30/2024 23:52:47 - INFO - faiss.loader - Successfully loaded faiss. 2024-04-30 23:52:47.990022: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-30 23:52:47.990076: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-30 23:52:47.991470: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-30 23:52:49.207020: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 04/30/2024 23:52:49 - INFO - src.retrieval.modeling_dense - Loading tokenizer and model from BAAI/bge-base-en... max_steps is given, it will override any value given in num_train_epochs 0% 0/2000 [00:00<?, ?it/s]Traceback (most recent call last): File "/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py", line 157, in main() File "/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py", line 150, in main trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2249, in inner_training_loop grad_norm = self.accelerator.clip_grad_norm( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2157, in clip_grad_norm self.unscale_gradients() File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2107, in unscale_gradients self.scaler.unscale_(opt) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 336, in unscale_ optimizer_state["found_inf_per_device"] = self.unscale_grads( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 258, in unscale_grads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients. 0% 0/2000 [00:01<?, ?it/s] [2024-04-30 23:53:01,805] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5985) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/content/FlagEmbedding/FlagEmbedding/llm_embedder/run_dense.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-30_23:53:01 host : f8adfa8a5d97 rank : 0 (local_rank: 0) exitcode : 1 (pid: 5985) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

QuangTQV avatar May 01 '24 00:05 QuangTQV

Can anyone help me, thanks?

QuangTQV avatar May 02 '24 06:05 QuangTQV

Hi, please try specify --dtype fp32 in the training script.

namespace-Pt avatar May 02 '24 07:05 namespace-Pt

Hi, please try specify --dtype fp32 in the training script.

After finetuning, I tested several cases. The positive samples scored around 0.9, while the negative samples scored around 0.84. I feel this is not acceptable, how can I widen this gap? I fine-tuned for the retrieval task

QuangTQV avatar May 02 '24 08:05 QuangTQV

Hi, this is the direct result of contrastive learning. It only guarantees the positives have higher scores than negatives, while not assuring that their gaps are large enough. You can try using some margin based loss to emphasize greater gaps between positives and negatives. However, the model may be harder to train given losses other than contrastive learning.

namespace-Pt avatar May 04 '24 05:05 namespace-Pt