retune lowVramPatch VRAM accounting

Open rattus128 opened this issue 2 months ago • 0 comments

In the lowvram case, this now does its math in the model dtype post de-quantization. Account for that. The patching was also put back on the compute stream getting it off-peak so relax the MATH_FACTOR to only x2 so get out of the worst-case assumption of everything peaking at once.

RTX3060, flux2 fp8 with Lora:

Before:

After:

Dec 07 '25 12:12 rattus128