[Bug] Llava 1.6 34b Cuda OOM when running API server
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
Possibly related to https://github.com/InternLM/lmdeploy/issues/1334
Describe the bug
This is using lmdeploy:latest Docker image.
After running the API server for a few minutes (about 3 minutes) under a small number of concurrent requests, I observe memory usage quickly growing and possibly leaking. This is observed with model Llava 1.6-34b. Potentially it could happen with smaller models, over a long enough time frame.
Reproduction
Run the API server with Llava 1.6-34b, and submit requests.
Memory usage starts at ~77GB. Rapid spike to 79GB. OOM after passing 80GB. After ~3 minutes, server dies with Cuda OOM.
Which version of your lmdeploy? https://github.com/InternLM/lmdeploy/issues/1334 is CPU memory leak, and it's fixed in 0.4.0 version
The vision model has default batch_size of 16, which will use larger gpu compared with start. You could change this value to 1.
We will later make this parameter configurable.
@irexyc limiting batch size to 3 appears to alleviate this issue, your observation is correct. It would be a good idea to make this a configurable parameter.
For now, the vision model are loaded balanced on multi gpus and the default batch size of vison model is set 1. You can set VisonConfig to change the batch size of vision model.