Request for code to quantize and convert ModernBERT Model to ONNX
I am fine-tuning the ModernBERT model for a classification task and now need to quantize and convert it to ONNX. I tried using the Hugging Face Optimum library, but it does not currently support ModernBERT.
I noticed that quantized models are available in the ModernBERT's Hugging Face repository. Could you please share the code or steps used to quantize and convert these models to ONNX?
Hi @DeepakSinghRawat 👋 Here's the dev branch of Optimum we used to convert the models to ONNX: https://github.com/huggingface/optimum/pull/2131
You can install it with
pip install --upgrade git+https://github.com/huggingface/optimum.git@add-modernbert-onnx
We're doing the final reviews for it now and it should be usable in the next version of Optimum.
Quantization can be done with this conversion script: https://github.com/huggingface/transformers.js/blob/main/scripts/quantize.py
@xenova thank you for the response. I did try https://github.com/huggingface/optimum/pull/2131 to convert to ONNX earlier. But when I try to load the model using ORTModelForSequenceClassification.from_pretrained(model_dir, export=True) I am getting the following error:
RuntimeError: Detected that you are using FX to torch.jit.trace a dynamo-optimized function. This is not supported at the moment.
Not sure how to fix this error.
That's what the DisableCompileContextManager class fixes.
class DisableCompileContextManager:
def __init__(self):
self._original_compile = torch.compile
def __enter__(self):
# Turn torch.compile into a no-op
torch.compile = lambda *args, **kwargs: lambda x: x
def __exit__(self, exc_type, exc_val, exc_tb):
torch.compile = self._original_compile
usage:
with DisableCompileContextManager():
model = ... # load model here
You should be able to export the model from the cli without needing this, though.
Thank you for all the help. I tried using the DisableCompileContextManager but getting following error now:
triton.compiler.errors.CompilationError: at 32:22:
# Meta-parameters
BLOCK_K: tl.constexpr,
IS_SEQLEN_OFFSETS_TENSOR: tl.constexpr,
IS_VARLEN: tl.constexpr,
INTERLEAVED: tl.constexpr,
CONJUGATE: tl.constexpr,
BLOCK_M: tl.constexpr,
):
pid_m = tl.program_id(axis=0)
pid_batch = tl.program_id(axis=1)
pid_head = tl.program_id(axis=2)
rotary_dim_half = rotary_dim // 2
^
IncompatibleTypeErrorImpl('invalid operands of type pointer<int64> and triton.language.int32')
I also tried directly using the cli but that's throwing a different error:
optimum/subpackages.py", line 49, in load_namespace_modules
if not dist_name.startswith(f"{namespace}-"):
AttributeError: 'NoneType' object has no attribute 'startswith'
Just fyi, I fine tuned the following distilled modernbert version and trying to convert the fine-tuned version to ONNX and quantize it: https://huggingface.co/andersonbcdefg/distilmodernbert
FWIW, I had the same issue as @DeepakSinghRawat (on rotary_dim) but @xenova managed to do the exports for ModernBERT-embed-large using a collab notebook.
Maybe he could share this notebook as a workaround for a bit, until the root cause of the issue is found.
@NohTow Here's a solution provided by @wakaka6 https://github.com/huggingface/transformers/issues/35545#issuecomment-2589533973.
I haven't tried it yet because I'm currently working on something else, but you might want to give it a shot and see if it works for you.
I have solved this problem.
.from_pretrained(model_path,attn_implementation="eager",reference_compile=False)
Thanks the solutions provided in this discussions and https://github.com/huggingface/transformers/issues/35545#issuecomment-2589533973
As I said in other issue, unfortunately disabling fa2 and using eager attention doesn't fix the problem. Without fa2, modernbert loses most of its architectural advances (lower vram and fast processing on long context).
Any updates on supporting fa2 with onnx?
Hi @xenova @anunknowperson, I am running into the exact same issues when trying to export an ONNX compiled version of ModernBERT. One of the main advantages of ModernBERT is inference speed, and the lack of support for Flash Attention is really hindering that. Is there a plan to look into this?