text-generation-inference Warming up model

Hello,

After the new 0.9 update, it seems to be that there is a new "Warmup Model" feature added at the start. This is causing an issue where the model is taking double the vram. I first thought that it uses the input and output token length to determine the max vram usage but it doesnt seem to be case as when the questions are asked, the vram goes up causing it to oom eventually which did not happen before.

I am on 4 A10G's 96gb vram and using a 13b model at fp16. It works fine when I use the older build. I also tried to build the image again locally by removing the warmup code, it does get rid of the double vram but it errors out as I believe its needed for the new factorization of the code.

Old build = 32gb vram New build = ~64gb vram

Jul 04 '23 06:07 Ichigo3766

Have you read the stack trace? In 0.8 your deployment would have oomed at high throughput with your current settings. 0.9 tells you this from the begining and asks you to decrease your batching parameters.

Jul 04 '23 07:07 OlivierDehaene

This was fixed! Not sure what happened but working good now

Jul 13 '23 17:07 Ichigo3766