text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Warming up model

Open Ichigo3766 opened this issue 2 years ago • 1 comments

Hello,

After the new 0.9 update, it seems to be that there is a new "Warmup Model" feature added at the start. This is causing an issue where the model is taking double the vram. I first thought that it uses the input and output token length to determine the max vram usage but it doesnt seem to be case as when the questions are asked, the vram goes up causing it to oom eventually which did not happen before.

I am on 4 A10G's 96gb vram and using a 13b model at fp16. It works fine when I use the older build. I also tried to build the image again locally by removing the warmup code, it does get rid of the double vram but it errors out as I believe its needed for the new factorization of the code.

Old build = 32gb vram New build = ~64gb vram

Ichigo3766 avatar Jul 04 '23 06:07 Ichigo3766

Have you read the stack trace? In 0.8 your deployment would have oomed at high throughput with your current settings. 0.9 tells you this from the begining and asks you to decrease your batching parameters.

OlivierDehaene avatar Jul 04 '23 07:07 OlivierDehaene

This was fixed! Not sure what happened but working good now

Ichigo3766 avatar Jul 13 '23 17:07 Ichigo3766