RuntimeError due to dtype mismatch in fused_linear_cross_entropy_forward

Open kostum123 opened this issue 1 year ago • 0 comments

🐛 Describe the bug

I encountered a RuntimeError while running a full fine-tuning experiment using the LLaMA-Factory on a model with BFloat16 precision. The error occurred during the training process when executing the fused_linear_cross_entropy_forward operation. The error traceback indicates a mismatch in data types between mat1 and mat2, specifically BFloat16 and Float. The models used were qwen2.5 3b and llama3.2 3b.

Error Log 0% 0/1376 [00:00<?, ?it/s]Traceback (most recent call last): File "/usr/local/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/content/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main run_exp() File "/content/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/content/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2388, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3485, in training_step loss = self.compute_loss(model, inputs) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss outputs = model(**inputs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 820, in forward return model_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 808, in call return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast return func(*args, **kwargs) File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/transformers/model/qwen2.py", line 108, in lce_forward loss = lce(self.lm_head.weight, shift_hidden_states, shift_labels) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/transformers/fused_linear_cross_entropy.py", line 13, in forward return LigerFusedLinearCrossEntropyFunction.apply( File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/ops/fused_linear_cross_entropy.py", line 221, in forward loss, grad_input, grad_weight, grad_bias = fused_linear_cross_entropy_forward( File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/ops/fused_linear_cross_entropy.py", line 122, in fused_linear_cross_entropy_forward torch.addmm( RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float

Reproduce

Steps to Reproduce Use Colab with A100 40GB. Run the full fine-tuning experiment with the LLaMA-Factory on a model with BFloat16 precision. Observe the error during the training process. Expected Behavior The training process should execute without encountering a RuntimeError due to dtype mismatch.

Temporary Fix Comment out the line causing the error in the fused_linear_cross_entropy_forward function located in src/liger_kernel/ops/fused_linear_cross_entropy.py. Line 101: logits_chunk = logits_chunk.to(dtype)

Versions

Main

Oct 12 '24 16:10 kostum123