infinity Infinity embed crashes too easily

System Info

0.0.74

Information

[X] Docker + cli
[ ] pip + cli
[ ] pip + usage of Python interface

Tasks

[X] An officially supported CLI command
[ ] My own modifications

Reproduction

docker with command: >
  v2
  --model-id Alibaba-NLP/gte-large-en-v1.5
  --batch-size 8
  --url-prefix "/v1"
  --port 80

Initially, the GPU memory usage starts at just a few gigabytes. However, after running hundreds of calls, the memory consumption gradually increases to over 40GB, eventually resulting in an OOM (Out of Memory) error.

The API should be robust enough to handle heavy usage without crashing or becoming unresponsive, as such issues hinder its usability and reliability. A potential solution could involve implementing a restriction, such as automatically truncating documents that exceed a specified size.

Jan 15 '25 22:01 taoari

Same problem here.

Jan 23 '25 13:01 kime541200

Note:

likely related to use_cache=True which is a setting for causal-las
potentially retains the KV-Cache from previous generations.

40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.

Jan 24 '25 05:01 michaelfeil

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

Jan 24 '25 12:01 luzhongqiu

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

I reduced the batch and it stopped happening, but I don't think this is a good solution. I'm still looking for a better solution.

Jan 24 '25 13:01 kime541200

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

M3 should not have this issue at all. Can you send the logs here?

Jan 24 '25 16:01 michaelfeil

Note:

likely related to use_cache=True which is a setting for causal-las

potentially retains the KV-Cache from previous generations.

40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.

@michaelfeil Thanks for the reply. Could you share with the options for max_chars_per_batch in cli? I found no such options in https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/cli.py#L153

Jan 27 '25 20:01 taoari

I have the same problem with jina-embeddings-v3. I never send batches of more than 8/10 strings, and never exceed 8k tokens per request. However, when the model is hit with lots of requests, the memory grows up to 30GB in a few minutes, and eventually the process is OOM killed. I already tried reducing the batch size, from 32 to 16, but that didn't change anything. I also tried switching the engine to "optimum", as I saw an onnx folder in the model's repo, but it didn't work. The model runs in a shared H100-nvl card.

Feb 20 '25 20:02 luonist

After some further tests, it looks setting batch-size = 4 prevents the memory to grow indefinitely. This, however, comes with a performance degradation, roughly of 50% of requests processed per second.

Feb 24 '25 11:02 luonist

After some further tests, it looks setting batch-size = 4 prevents the memory to grow indefinitely. This, however, comes with a performance degradation, roughly of 50% of requests processed per second.

This aligns with my usage experience. Currently, I've set the batch size between 6 to 8 to avoid memory issues, but I'm still unsure of the underlying cause of this problem.

Feb 24 '25 15:02 kime541200

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

M3 should not have this issue at all. Can you send the logs here?

I am running into OOM with the same model https://huggingface.co/Xenova/bge-m3. I have to reduce the batch-size to 8 to startup the engine. But performance is bad. Log as below:

2025-03-23T22:54:01.871864386+08:00 INFO     2025-03-23 14:54:01,866 infinity_emb INFO: Getting   select_model.py:97

2025-03-23T22:54:01.871908268+08:00          timings for batch_size=8 and avg tokens per                            

2025-03-23T22:54:01.871913261+08:00          sentence=4                                                             

2025-03-23T22:54:01.871915431+08:00                  0.48     ms tokenization                                       

2025-03-23T22:54:01.871917425+08:00                  180.15   ms inference                                          

2025-03-23T22:54:01.871919501+08:00                  1.25     ms post-processing                                    

2025-03-23T22:54:01.871922056+08:00                  181.88   ms total                                              

2025-03-23T22:54:01.871923807+08:00          embeddings/sec: 43.98                                                  

2025-03-23T22:54:33.664568882+08:00 INFO     2025-03-23 14:54:33,575 infinity_emb INFO: Getting  select_model.py:103

2025-03-23T22:54:33.664597042+08:00          timings for batch_size=8 and avg tokens per                            

2025-03-23T22:54:33.664600131+08:00          sentence=515                                                           

2025-03-23T22:54:33.664602713+08:00                  3.08     ms tokenization                                       

2025-03-23T22:54:33.664605161+08:00                  12401.69         ms inference                                  

2025-03-23T22:54:33.664607358+08:00                  0.19     ms post-processing                                    

2025-03-23T22:54:33.664609560+08:00                  12404.96         ms total                                      

2025-03-23T22:54:33.664611654+08:00          embeddings/sec: 0.64

If I don't set batch-size=8, it crashes every time after first test with sentence=4. I have allocated 16GB to the container. Something must be wrong near select_model.py:103

BTW, I tried to find a way to specify which onnx file should the engine to load into, but failed. Seems infinity doesn't support this option. Will infinity load all of onnx files into memory? If so, maybe this is the cause of oom?

Mar 23 '25 15:03 thiner