How do i run inference with multiple models without maxing out my GPU VRAM?
I'm trying to tag a dataset using more than one WD14 model, so i wrote a simple script that iterates all the files in a directory for every model in a list, like this
model_list = ['SwinV2', 'ConvNextV2', 'MOAT', 'ViT']
for m in model_list:
for child in directory.glob('**/*'):
ratings, features, chars = get_wd14_tags(child, model_name=m, general_threshold=thresh)
my problem is, after every loop in model_list, inference time increases a lot, for the first loop it takes ~0.15 seconds to extract the tags from each image, no matter how many images, but by the time i'm doing the 4th loop it takes 15 seconds. But if i run the script 4 times with a single model in the list, every model takes the same 0.15 seconds.
Running it with the task manager open, i noticed that every time it loads a new model, the dedicated VRAM used by python increases by ~1.5gb, so once i reach the third loop, my poor 970 doesn't have any more free memory so i guess it start using the system RAM and that's why it slows down.
Is there a way to free the VRAM before loading a new model? I tried looking in the ONNX documentation but it's way above my level of understanding.
I'm running it on Win10 / onnxruntime-gpu / cuda11.8
by the time i'm doing the 4th loop it takes 15 seconds
Is this mean 15 secs per image? or something else?
Running it with the task manager open, i noticed that every time it loads a new model, the dedicated VRAM used by python increases by ~1.5gb, so once i reach the third loop, my poor 970 doesn't have any more free memory so i guess it start using the system RAM and that's why it slows down.
I remember the Geforce GTX 790 has at least 12GB of VRAM, so it seems not to be related to VRAM.
yes, it's 15 seconds per image once i max out the VRAM. Unfortunately the gtx970 has only 3.5 gb of VRAM, it's almot 10years old at this point, maybe you're thinking of some newer AMD card with a similar name
Actually, we can release VRAM by clearing the cache. The source code is available here: https://github.com/deepghs/imgutils/blob/main/imgutils/tagging/wd14.py#L69
Here's how you can use it:
from imgutils.tagging.wd14 import _get_wd14_model
_get_wd14_model.cache_clear()
Once the cache is cleared, the previously loaded model will be released.
However, this method is currently just a workaround. A more suitable approach would be for us to provide a complete VRAM management layer in the future. This part has already been added to the todo list.
Works perfectly for what i need to do, thanks for the help and also for writing this library, i spent months trying every commercial software with auto tagging but they were all too generic, while this does exactly what i wanted