KUNPENG GUO
KUNPENG GUO
Someone has updates on this PR? It would be great if it's merged to the main code base..
``` FlashAttention backward for head dim > 64 requires A100 or H100 GPUs as the implementation needs a large amount of shared memory. ``` This might be related.. got this...
mentioned in #442
Hey @DarkLight1337 , Can we put a up limit argument to configure up to how much the server process can grab the vram to avoid surprises in the deployments? Or...
Hey @DarkLight1337 , > You can set `--gpu-memory-utilization` to cap the GPU memory usage 1) That won't work for the encoder-based embedder 2) It is in fact considering only the...
Update: currently if deploys [bge-model](https://huggingface.co/BAAI/bge-large-en-v1.5), the memory grows... it breaks the server with OOM from time to time, so we have to restart it.