bumblebee icon indicating copy to clipboard operation
bumblebee copied to clipboard

Add JinaBert model

Open joelpaulkoch opened this issue 1 year ago • 1 comments

I want to share my work on the JinaBert model. Not sure if you want to include it at all, since it's not officially part of transformers, you must specify trust_remote_code=True when running it with transformers, and there is still an open issue.

This PR would enable bumblebee users to run the jina embeddings v2 models.

The implementation of JinaBert is here.

Another issue with this being a custom implementation is that there is another variant that I started to work on: jinaai/jina-embeddings-v2-base-code.

Both, jinaai/jina-embeddings-v2-base-en and jinaai/jina-embeddings-v2-base-code, specify JinaBertForMaskedLM as architecture but point to different implementations.

 "_name_or_path": "jinaai/jina-bert-implementation",
  "architectures": [
    "JinaBertForMaskedLM"
  ],
  "auto_map": {
    "AutoConfig": "jinaai/jina-bert-implementation--configuration_bert.JinaBertConfig",
    "AutoModelForMaskedLM": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForMaskedLM",
    "AutoModel": "jinaai/jina-bert-implementation--modeling_bert.JinaBertModel",
    "AutoModelForSequenceClassification": "jinaai/jina-bert-implementation--modeling_bert.JinaBertForSequenceClassification"
  },

vs.

  "_name_or_path": "jinaai/jina-bert-v2-qk-post-norm",
  "architectures": [
    "JinaBertForMaskedLM"
  ],
  "auto_map": {
    "AutoConfig": "jinaai/jina-bert-v2-qk-post-norm--configuration_bert.JinaBertConfig",
    "AutoModel": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertModel",
    "AutoModelForMaskedLM": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertForMaskedLM",
    "AutoModelForSequenceClassification": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertForSequenceClassification"
  },

Is there a mechanism in bumblebee to distinguish these?

There are still some issues in this PR, I will add comments and can work on them over the next days/weeks.

joelpaulkoch avatar Nov 08 '24 10:11 joelpaulkoch

Hey @joelpaulkoch, thanks for the PR and a great article!

To be honest, I am hesitant to support implementations from the Hub because (a) theoretically they are less stable, because they may be still subject to tweaks; (b) model proliferation is more likely, the jina-embeddings-v2-base-en vs jina-embeddings-v2-base-code is a good example.

We generally wait until models make it to hf/transformers, though from https://github.com/huggingface/transformers/issues/27035 it's not clear if that's ever going to happen.

At the moment, I would defer the decision and see how the status quo evolves. People can still use the model by installing bumblebee as {:bumblebee, github: "joelpaulkoch/jina-embeddings-v2"}.

jonatanklosko avatar Nov 12 '24 08:11 jonatanklosko

I just closed my unsuccessful draft PR to bring this model into transformers, so I'm going to close this PR too.

As you've said, it's still available here: https://github.com/joelpaulkoch/bumblebee/tree/jina-embeddings-v2-base-code https://github.com/joelpaulkoch/bumblebee/tree/jina-embeddings-v2

joelpaulkoch avatar Oct 22 '25 15:10 joelpaulkoch

@joelpaulkoch thank you for the effort!

jonatanklosko avatar Oct 22 '25 16:10 jonatanklosko