lmdeploy [Bug] Llava 1.6 34b Cuda OOM when running API server

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Possibly related to https://github.com/InternLM/lmdeploy/issues/1334

Describe the bug

This is using lmdeploy:latest Docker image.

After running the API server for a few minutes (about 3 minutes) under a small number of concurrent requests, I observe memory usage quickly growing and possibly leaking. This is observed with model Llava 1.6-34b. Potentially it could happen with smaller models, over a long enough time frame.

Reproduction

Run the API server with Llava 1.6-34b, and submit requests.

Memory usage starts at ~77GB. Rapid spike to 79GB. OOM after passing 80GB. After ~3 minutes, server dies with Cuda OOM.

FINAL

Apr 29 '24 21:04 ghost

Which version of your lmdeploy？ https://github.com/InternLM/lmdeploy/issues/1334 is CPU memory leak, and it's fixed in 0.4.0 version

Apr 30 '24 05:04 zhulinJulia24

The vision model has default batch_size of 16, which will use larger gpu compared with start. You could change this value to 1.

We will later make this parameter configurable.

Apr 30 '24 07:04 irexyc

@irexyc limiting batch size to 3 appears to alleviate this issue, your observation is correct. It would be a good idea to make this a configurable parameter.

Apr 30 '24 21:04 vody-am

For now, the vision model are loaded balanced on multi gpus and the default batch size of vison model is set 1. You can set VisonConfig to change the batch size of vision model.

May 23 '24 11:05 irexyc