convert_to_onnx.py script performs bad with urchade/gliner_multi-v2.1 base
Hi, I've tried to quantize the base model urchade/gliner_multi-v2.1 using the convert_to_onnx.py script.
Unfortunately, the model prediction quality degraded significantly after quantization.
This does not happen for urchade/gliner_small-v2.1.
🧪 Environment:
- Python:
3.10.18 - GLiNER:
0.2.21 - Torch:
2.7.0
✅ Models tested:
| Model Variant | Base Performance | Quantized Performance |
|---|---|---|
urchade/gliner_small_v2.1 |
✅ good | ✅ decent |
urchade/gliner_multi_v2.1 |
✅ good | ⚠️ poor |
✅ gliner_small_v2.1 ONNX test
# Load ONNX model
model = GLiNER.from_pretrained(model_name, load_onnx_model=True, load_tokenizer=True)
text = """Daniel Evans has a passport number K01234567 and a permanent address
at 500 Pine Street, Seattle. His contact number is +1-206-555-0199."""
labels = ["name", "passport_number", "street_address", "phone_number"]
entities = model.predict_entities(text, labels, threshold=0.4)
for entity in entities:
print(entity["text"], "=>", entity["label"], entity["score"])
Output:
Daniel Evans => name 0.9867521524429321
K01234567 => passport_number 0.8854160904884338
500 Pine Street => street_address 0.9752442240715027
+1-206-555-0199 => phone_number 0.7612784504890442
⚠️ gliner_small_v2.1 quantized test
# Load quantized ONNX model
model_quant = GLiNER.from_pretrained(
model_name,
load_onnx_model=True,
load_tokenizer=True,
onnx_model_file="model_quantized.onnx"
)
text = """Daniel Evans has a passport number K01234567 and a permanent address
at 500 Pine Street, Seattle. His contact number is +1-206-555-0199."""
labels = ["name", "passport_number", "street_address", "phone_number"]
entities = model_quant.predict_entities(text, labels, threshold=0.1)
for entity in entities:
print(entity["text"], "=>", entity["label"], entity["score"])
Output:
Daniel Evans => name 0.9830107688903809
K01234567 => passport_number 0.9768995642662048
500 Pine Street, Seattle => street_address 0.9961038065190253
+1-206-555-0199 => phone_number 0.453078476495643
❌ gliner_multi_v2.1 quantized test
# Load quantized ONNX model
model_quant = GLiNER.from_pretrained(
model_name,
load_onnx_model=True,
load_tokenizer=True,
onnx_model_file="model_quantized.onnx"
)
text = """Daniel Evans has a passport number K01234567 and a permanent address
at 500 Pine Street, Seattle. His contact number is +1-206-555-0199."""
labels = ["name", "passport_number", "street_address", "phone_number"]
entities = model_quant.predict_entities(text, labels, threshold=0.1)
for entity in entities:
print(entity["text"], "=>", entity["label"], entity["score"])
Output:
Daniel Evans => name 0.21395745873451233
K01234567 => passport_number 0.15609435737133026
Would appreciate any guidance on how to maintain quality post-quantization for the multi variant.
Hi @mikeg27, post-quantization performance is largely determined by the model’s training regime. Models not trained with quantization-aware methods often suffer from activation/weight distribution shifts once precision is reduced. Since none of these models were QAT-trained, accuracy varies depending on their robustness and capacity. urchade/gliner_small_v2.1 is a smaller model with simpler internal representations, so it’s less sensitive to quantization-induced perturbations and therefore degrades less when precision changes.