n8programs
n8programs
Confirmed, SGD works.
But my god, float32 is brutal. 1/10th the speed of float16...
How come mlx fails in 16-bit if most big models are pretrained that way? Is it cause it doesn't use bfloat16?
Got it. Thank you for the info!
Can confirm the effectiveness of float32 end-to-end tuning on tinyllama.
Do you perform your full fine-tune in float32?
Tried training qwen-1.8b. NaN loss immediately. Will try phi-2.
Think its the float16.
Just checked - NaN w/ phi.
In-python implementation, yoinked from torch and ported w/ Claude - appears to work in training, though: ```python def _compute_T1(A): """I + A""" return mx.eye(A.shape[-1]) + A def _compute_T2(A): """I +...