Warming up model
Hello,
After the new 0.9 update, it seems to be that there is a new "Warmup Model" feature added at the start. This is causing an issue where the model is taking double the vram. I first thought that it uses the input and output token length to determine the max vram usage but it doesnt seem to be case as when the questions are asked, the vram goes up causing it to oom eventually which did not happen before.
I am on 4 A10G's 96gb vram and using a 13b model at fp16. It works fine when I use the older build. I also tried to build the image again locally by removing the warmup code, it does get rid of the double vram but it errors out as I believe its needed for the new factorization of the code.
Old build = 32gb vram New build = ~64gb vram
Have you read the stack trace? In 0.8 your deployment would have oomed at high throughput with your current settings. 0.9 tells you this from the begining and asks you to decrease your batching parameters.
This was fixed! Not sure what happened but working good now