Sudden nan values from the loss during LoRA training

Open MH-Python opened this issue 1 year ago • 0 comments

Thank you for the nice compact work. We have started recently to face an ambiguous error casing the loss to become nan during the training. After enabling anomaly detection " torch.autograd.set_detect_anomaly(True)" We got this:

UserWarning: Error detected in MmBackward0. Traceback of forward call that caused the error: ...stacktrace... .venv/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 569, in forward result = result + lora_B(lora_A(dropout(x))) * scaling ...stacktrace... return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Function 'MmBackward0' returned nan values in its 1th output.

Could it be caused by some numerical instability (nan or inf)?

Dec 04 '24 09:12 MH-Python