🐛 Describe the bug
I encountered a RuntimeError while running a full fine-tuning experiment using the LLaMA-Factory on a model with BFloat16 precision. The error occurred during the training process when executing the fused_linear_cross_entropy_forward operation. The error traceback indicates a mismatch in data types between mat1 and mat2, specifically BFloat16 and Float. The models used were qwen2.5 3b and llama3.2 3b.
Error Log
0% 0/1376 [00:00<?, ?it/s]Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/content/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main
run_exp()
File "/content/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/content/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2052, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2388, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3485, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3532, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 820, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 808, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
return func(*args, **kwargs)
File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/transformers/model/qwen2.py", line 108, in lce_forward
loss = lce(self.lm_head.weight, shift_hidden_states, shift_labels)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/transformers/fused_linear_cross_entropy.py", line 13, in forward
return LigerFusedLinearCrossEntropyFunction.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/ops/fused_linear_cross_entropy.py", line 221, in forward
loss, grad_input, grad_weight, grad_bias = fused_linear_cross_entropy_forward(
File "/content/LLaMA-Factory/Liger-Kernel/src/liger_kernel/ops/fused_linear_cross_entropy.py", line 122, in fused_linear_cross_entropy_forward
torch.addmm(
RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float
Reproduce
Steps to Reproduce
Use Colab with A100 40GB.
Run the full fine-tuning experiment with the LLaMA-Factory on a model with BFloat16 precision.
Observe the error during the training process.
Expected Behavior
The training process should execute without encountering a RuntimeError due to dtype mismatch.
Temporary Fix
Comment out the line causing the error in the fused_linear_cross_entropy_forward function located in src/liger_kernel/ops/fused_linear_cross_entropy.py. Line 101: logits_chunk = logits_chunk.to(dtype)
Versions
Main