ModernBERT Request for code to quantize and convert ModernBERT Model to ONNX

I am fine-tuning the ModernBERT model for a classification task and now need to quantize and convert it to ONNX. I tried using the Hugging Face Optimum library, but it does not currently support ModernBERT.

I noticed that quantized models are available in the ModernBERT's Hugging Face repository. Could you please share the code or steps used to quantize and convert these models to ONNX?

Jan 07 '25 07:01 DeepakSinghRawat

Hi @DeepakSinghRawat 👋 Here's the dev branch of Optimum we used to convert the models to ONNX: https://github.com/huggingface/optimum/pull/2131

You can install it with

pip install --upgrade git+https://github.com/huggingface/optimum.git@add-modernbert-onnx

We're doing the final reviews for it now and it should be usable in the next version of Optimum.

Jan 07 '25 13:01 xenova

Quantization can be done with this conversion script: https://github.com/huggingface/transformers.js/blob/main/scripts/quantize.py

Jan 07 '25 14:01 xenova

@xenova thank you for the response. I did try https://github.com/huggingface/optimum/pull/2131 to convert to ONNX earlier. But when I try to load the model using ORTModelForSequenceClassification.from_pretrained(model_dir, export=True) I am getting the following error:

RuntimeError: Detected that you are using FX to torch.jit.trace a dynamo-optimized function. This is not supported at the moment.

Not sure how to fix this error.

Jan 07 '25 15:01 DeepakSinghRawat

That's what the DisableCompileContextManager class fixes.

class DisableCompileContextManager:
    def __init__(self):
        self._original_compile = torch.compile

    def __enter__(self):
        # Turn torch.compile into a no-op
        torch.compile = lambda *args, **kwargs: lambda x: x

    def __exit__(self, exc_type, exc_val, exc_tb):
        torch.compile = self._original_compile

usage:

with DisableCompileContextManager():
    model = ... # load model here

You should be able to export the model from the cli without needing this, though.

Jan 07 '25 15:01 xenova

Thank you for all the help. I tried using the DisableCompileContextManager but getting following error now:

triton.compiler.errors.CompilationError: at 32:22:
    # Meta-parameters
    BLOCK_K: tl.constexpr,
    IS_SEQLEN_OFFSETS_TENSOR: tl.constexpr,
    IS_VARLEN: tl.constexpr,
    INTERLEAVED: tl.constexpr,
    CONJUGATE: tl.constexpr,
    BLOCK_M: tl.constexpr,
):
    pid_m = tl.program_id(axis=0)
    pid_batch = tl.program_id(axis=1)
    pid_head = tl.program_id(axis=2)
    rotary_dim_half = rotary_dim // 2
                      ^
IncompatibleTypeErrorImpl('invalid operands of type pointer<int64> and triton.language.int32')

I also tried directly using the cli but that's throwing a different error:

optimum/subpackages.py", line 49, in load_namespace_modules
    if not dist_name.startswith(f"{namespace}-"):
AttributeError: 'NoneType' object has no attribute 'startswith'

Just fyi, I fine tuned the following distilled modernbert version and trying to convert the fine-tuned version to ONNX and quantize it: https://huggingface.co/andersonbcdefg/distilmodernbert

Jan 07 '25 15:01 DeepakSinghRawat

FWIW, I had the same issue as @DeepakSinghRawat (on rotary_dim) but @xenova managed to do the exports for ModernBERT-embed-large using a collab notebook. Maybe he could share this notebook as a workaround for a bit, until the root cause of the issue is found.

Jan 14 '25 13:01 NohTow

@NohTow Here's a solution provided by @wakaka6 https://github.com/huggingface/transformers/issues/35545#issuecomment-2589533973.
I haven't tried it yet because I'm currently working on something else, but you might want to give it a shot and see if it works for you.

Jan 14 '25 16:01 DeepakSinghRawat

I have solved this problem. .from_pretrained(model_path,attn_implementation="eager",reference_compile=False)

Thanks the solutions provided in this discussions and https://github.com/huggingface/transformers/issues/35545#issuecomment-2589533973

Feb 19 '25 03:02 kylin-zhou

As I said in other issue, unfortunately disabling fa2 and using eager attention doesn't fix the problem. Without fa2, modernbert loses most of its architectural advances (lower vram and fast processing on long context).

Any updates on supporting fa2 with onnx?

Mar 20 '25 19:03 anunknowperson

Hi @xenova @anunknowperson, I am running into the exact same issues when trying to export an ONNX compiled version of ModernBERT. One of the main advantages of ModernBERT is inference speed, and the lack of support for Flash Attention is really hindering that. Is there a plan to look into this?

Apr 22 '25 16:04 animesharma