Implement Jina CLIP v2 and NewBie dual CLIP

Open woct0rdho opened this issue 1 month ago • 0 comments

I've implemented Jina CLIP v2, which is used by the NewBie image model. Before this PR, ComfyUI already supports NewBie's DiT (see https://github.com/comfyanonymous/ComfyUI/pull/11172 ) and gemma-3-4b-it text encoder (in Lumina2). After this PR, we can run NewBie with full functionality in native ComfyUI.

The implementation of Jina CLIP v2 is put in a single py file, in a way similar to the existing BERT and Llama/Gemma. The weights and the tokenizer are also packaged in a single safetensors file. I've tested that it produces the same clip_text_pooled as the official Jina CLIP v2 (within some floating point error).

Here is an image generated using this PR, with a simple workflow in it:

NewBie-Image-Exp0.1.safetensors is downloaded from https://huggingface.co/NewBie-AI/NewBie-image-Exp0.1/blob/main/transformer/diffusion_pytorch_model.safetensors
gemma_3_4b_it_bf16.safetensors is downloaded from https://huggingface.co/woctordho/comfyui-gemma-3-4b-it/blob/main/gemma_3_4b_it_bf16.safetensors
jina_clip_v2_bf16.safetensors is downloaded from https://huggingface.co/woctordho/comfyui-jina-clip-v2/blob/main/jina_clip_v2_bf16.safetensors
The prompt is copied from https://civitai.com/images/112919154

If I understand correctly, both Gemma and Jina need the system prompt written in the CLIPTextEncode node:

You are an assistant designed to generate high-quality anime images with the highest degree of image-text alignment based on xml format textual prompts. <Prompt Start>

otherwise the generated image will be garbage.

After this PR is merged, I can make another PR to support checkpoint loader (all-in-one model).

Dec 19 '25 10:12 woct0rdho