Write a custom flash-attention function for the deberta model.

Open wolfassi123 opened this issue 1 year ago • 1 comments

Model description

I used michaelf34/infinity:0.0.55 to deploy mixed_bread_large reranker.

The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.

What is the best way to optimize the deployment and inference?

Open source status

[X] The model implementation is available on transformers
[X] The model weights are available on huggingface-hub
[x] I verified that the model is currently not running in the lastest version pip install infinity_emb[all] --upgrade

Provide useful links for the implementation

No response

Sep 12 '24 10:09 wolfassi123

BGE large uses BERT. (infinity DOES overwrite the modeling code / flash-attention replacement) MixedBread-large uses DEBERTA. (infinity does not overwrite the modeling code / flash-attention replacement)

Deberta-V2 uses significant more flops (distentangeled attention), which also has a less optimized implementation.

Sep 12 '24 16:09 michaelfeil