Infinity embed crashes too easily
System Info
0.0.74
Information
- [X] Docker + cli
- [ ] pip + cli
- [ ] pip + usage of Python interface
Tasks
- [X] An officially supported CLI command
- [ ] My own modifications
Reproduction
docker with command: >
v2
--model-id Alibaba-NLP/gte-large-en-v1.5
--batch-size 8
--url-prefix "/v1"
--port 80
Initially, the GPU memory usage starts at just a few gigabytes. However, after running hundreds of calls, the memory consumption gradually increases to over 40GB, eventually resulting in an OOM (Out of Memory) error.
The API should be robust enough to handle heavy usage without crashing or becoming unresponsive, as such issues hinder its usability and reliability. A potential solution could involve implementing a restriction, such as automatically truncating documents that exceed a specified size.
Same problem here.
Note:
- likely related to
use_cache=Truewhich is a setting for causal-las - potentially retains the KV-Cache from previous generations.
40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.
same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(
same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(
I reduced the batch and it stopped happening, but I don't think this is a good solution. I'm still looking for a better solution.
same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(
M3 should not have this issue at all. Can you send the logs here?
Note:
- likely related to
use_cache=Truewhich is a setting for causal-las- potentially retains the KV-Cache from previous generations.
40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a
max_chars_per_batchparameter.
@michaelfeil Thanks for the reply. Could you share with the options for max_chars_per_batch in cli? I found no such options in https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/cli.py#L153
I have the same problem with jina-embeddings-v3. I never send batches of more than 8/10 strings, and never exceed 8k tokens per request. However, when the model is hit with lots of requests, the memory grows up to 30GB in a few minutes, and eventually the process is OOM killed. I already tried reducing the batch size, from 32 to 16, but that didn't change anything. I also tried switching the engine to "optimum", as I saw an onnx folder in the model's repo, but it didn't work. The model runs in a shared H100-nvl card.
After some further tests, it looks setting batch-size = 4 prevents the memory to grow indefinitely. This, however, comes with a performance degradation, roughly of 50% of requests processed per second.
After some further tests, it looks setting batch-size = 4 prevents the memory to grow indefinitely. This, however, comes with a performance degradation, roughly of 50% of requests processed per second.
This aligns with my usage experience. Currently, I've set the batch size between 6 to 8 to avoid memory issues, but I'm still unsure of the underlying cause of this problem.
same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(
M3 should not have this issue at all. Can you send the logs here?
I am running into OOM with the same model https://huggingface.co/Xenova/bge-m3. I have to reduce the batch-size to 8 to startup the engine. But performance is bad. Log as below:
2025-03-23T22:54:01.871864386+08:00 INFO 2025-03-23 14:54:01,866 infinity_emb INFO: Getting select_model.py:97
2025-03-23T22:54:01.871908268+08:00 timings for batch_size=8 and avg tokens per
2025-03-23T22:54:01.871913261+08:00 sentence=4
2025-03-23T22:54:01.871915431+08:00 0.48 ms tokenization
2025-03-23T22:54:01.871917425+08:00 180.15 ms inference
2025-03-23T22:54:01.871919501+08:00 1.25 ms post-processing
2025-03-23T22:54:01.871922056+08:00 181.88 ms total
2025-03-23T22:54:01.871923807+08:00 embeddings/sec: 43.98
2025-03-23T22:54:33.664568882+08:00 INFO 2025-03-23 14:54:33,575 infinity_emb INFO: Getting select_model.py:103
2025-03-23T22:54:33.664597042+08:00 timings for batch_size=8 and avg tokens per
2025-03-23T22:54:33.664600131+08:00 sentence=515
2025-03-23T22:54:33.664602713+08:00 3.08 ms tokenization
2025-03-23T22:54:33.664605161+08:00 12401.69 ms inference
2025-03-23T22:54:33.664607358+08:00 0.19 ms post-processing
2025-03-23T22:54:33.664609560+08:00 12404.96 ms total
2025-03-23T22:54:33.664611654+08:00 embeddings/sec: 0.64
If I don't set batch-size=8, it crashes every time after first test with sentence=4. I have allocated 16GB to the container. Something must be wrong near select_model.py:103
BTW, I tried to find a way to specify which onnx file should the engine to load into, but failed. Seems infinity doesn't support this option. Will infinity load all of onnx files into memory? If so, maybe this is the cause of oom?