GLiNER Onnx converted model has slower inference

I finetuned gliner small v2.1 model and created onnx version of the same model using the convert_to_onnx.ipynb exmple code. When I compared the inference time of both models, the onnx version took 50% more time.

This is how I'm loading the model: model = GLiNER.from_pretrained(model_path, load_onnx_model=True, load_tokenizer=True)

Sep 17 '24 09:09 yogitavm

From my experiments, ONNX models work faster for sequences smaller than 124 words. With a longer input sequence, attention becomes the limiting factor and ONNX is not necessarily more efficient than PyTorch. The main purpose of ONNX is to enable easier conversion of models between different frameworks and running in other environments. If you need efficient inference on CPU I would recommend to try GLiNER.cpp it is consistently faster than Pytorch and enables up to 2x acceleration.

Sep 21 '24 19:09 Ingvarstep

Thanks @Ingvarstep. I was going through GLiNER.cpp and could not find license details. Is it Apache 2.0 or MIT licensed?

Oct 01 '24 12:10 yogitavm

Thanks @Ingvarstep. I was going through GLiNER.cpp and could not find license details. Is it Apache 2.0 or MIT licensed?

It's Apache 2.0

Oct 01 '24 16:10 Ingvarstep