[Feature Request]: Manually load/unload checkpoints into GPU
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What would your feature do ?
I want to achieve the following either programmatically or via API:
- List of all checkpoints and their status (loaded or unloaded)
- Load a checkpoint
- Unload a checkpoint
Proposed workflow
- Retrieve the available checkpoints and their status via HTTP request e.g. http://0.0.0.0:7860/sdapi/v1/checkpoint-status
- Load a specified checkpoint via HTTP request e.g http://0.0.0.0:7860/sdapi/v1/load-checkpoint?checkpointid=abc123
- Unload a specified checkpoint via HTTP request e.g http://0.0.0.0:7860/sdapi/v1/unload-checkpoint?checkpointid=abc123 The checkpoint parameter passed in 2 and 3 should be obtained from 1. For example, the object returned in 1 could contain a "uniqueid" key.
Additional information
I've made a fork but it only loads and unloads the currently selected checkpoint. The relevant endpoints are unloadmodel, loadmodel and get_model_status in api.py. https://github.com/AnthoneoJ/stable-diffusion-webui
why?
why?
In my use case, the machine is running multiple AI services (one of them being this webui). There are several machines that do the same. So the checkpoints should be loaded upon machine boot up and unloaded if memory is needed for another AI service, etc.
modle loading is a mess in webui
I suggest you just settle with
Maximum number of checkpoints loaded at the same time to 1
and Only keep one model on device True
the 2 api's endpoints to be honest works more like putting web UI to sleep and wake it up from sleep
/sdapi/v1/unload-checkpoint and /sdapi/v1/reload-checkpoint
yo ucan put it to sleep and save VRAM and wake it before use
you need to manually
wakeit before use (bad design on our part)
there is an issue with /sdapi/v1/unload-checkpoint, if the Maximum number of checkpoints loaded at the same time is > 1, the sleep will only send the current main model to ram
there's no distinguish between which models it just unloads the main model, it only cares about the main model
example if you have
Maximum number of checkpoints loaded at the same timeset to 3Only keep one model on deviceFalseafter switch model or more 3 times there will be 3 modles loaded now if you use/sdapi/v1/unload-checkpointonly 1 model will be unloaded, 2 will be still loaded
change model (load model) can be doen by post to /sdapi/v1/options with
{
"sd_model_checkpoint": "YOUR model"
}
or you could use add override_settings arg in payload of txt2img / img2img api call
this method is generally more reliable when dealing with multiple users
"override_settings": {
"sd_model_checkpoint": "YOUR model"
}
you can get a list all models by using /sdapi/v1/sd-models
it should is possible to improve it but some people need to wanted enough to work on that future it might even be possible to implement this as an extension
I might trying to work on this but no guarantees
initially I was confused because I somehow misread your request as you wanting to load every model in sequence then unload them for no apparent reason
these can aslo help https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Command-Line-Arguments-and-Settings --nowebui --skip-load-model-at-start
Ah, thanks! I knew bits and pieces from inspecting the codebase. This puts them all together. One more thing before I can go off on my own: how do I know whether a model is currently loaded or not? At the moment, I'm inferring this from sd_models.model_data.sd_model. If it's None, the model is unloaded, and vice versa.
sd_models.model_data.sd_model
yeah I think that's pretty much the place you want to look
however if you also used Checkpoints to cache in RAM > 0 then
I think you also want to inspect shared.opts.sd_checkpoint_cache and checkpoints_loaded
if you have improvements that you think that can benefit everyone then don't hesitate to contribute
Ollama framework has a really handy environment and API accessible variable:
OLLAMA_KEEP_ALIVE=[# of seconds] | [xM] | 0
I think it's mostly used for people who want the last loaded chat model to stay loaded longer. But I use it set to zero to keep the GPU VRAM as empty as possible as soon as possible. This is because I have many users that mostly use the GPU for chat and occasionally for Text-to-speech and SD image creation - loading up the GPU VRAM. Unfortunately SDWeb keeps its last model loaded indefinitely. It would be great if SDWeb had a similar Keep Alive option to let us decide how long to keep the last model loaded.