Write a custom flash-attention function for the deberta model.
Model description
I used michaelf34/infinity:0.0.55 to deploy mixed_bread_large reranker.
The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.
What is the best way to optimize the deployment and inference?
Open source status
- [X] The model implementation is available on transformers
- [X] The model weights are available on huggingface-hub
- [x] I verified that the model is currently not running in the lastest version
pip install infinity_emb[all] --upgrade
Provide useful links for the implementation
No response
BGE large uses BERT. (infinity DOES overwrite the modeling code / flash-attention replacement) MixedBread-large uses DEBERTA. (infinity does not overwrite the modeling code / flash-attention replacement)
Deberta-V2 uses significant more flops (distentangeled attention), which also has a less optimized implementation.