Account for the VRAM cost of weight offloading (after next stable)

Open rattus128 opened this issue 2 months ago • 0 comments

Add accounting for the VRAM cost of weight offloading to avoid VRAM OOMs that occur due to the offload process having to buffer and manipulate weights.

This is particularly a risk with Loras and --async-offload (or both).

This will also budget for non --async-offload case (regular offload) as that in theory could occur in and around a VRAM peak, although async-offload is far more dangerous as there is no timing guarantee and the peak of compute could overlap the peak of offload.

Primary commit message:

when checking the VRAM headroom, assume that the weight needs to be offloaded, and only load if it has space for both the load and offload

the number of streams.

As the weights are ordered from largest to smallest by offload cost this is guaranteed to fit in VRAM (tm), as all weights that follow will be smaller.

Make the partial unload aware of this system as well by saving the budget for offload VRAM to the model state and accounting accordingly. Its possible that partial unload increases the size of the largest offloaded weights, and thus needs to unload a little bit more than asked to accomodate the bigger temp buffers.

Honor the existing codes floor on model weight loading of 128MB by having the patcher honor this separately withough regard to offloading. Otherwise when MM specifies its 128MB minimum, MP will see the biggest weights, and budget that 128MB to only offload buffer and load nothing which isnt the intent of these minimums. The same clamp applies in case of partial offload of the currently loading model.

Nov 13 '25 04:11 rattus128