Implement Jina CLIP v2 and NewBie dual CLIP
I've implemented Jina CLIP v2, which is used by the NewBie image model. Before this PR, ComfyUI already supports NewBie's DiT (see https://github.com/comfyanonymous/ComfyUI/pull/11172 ) and gemma-3-4b-it text encoder (in Lumina2). After this PR, we can run NewBie with full functionality in native ComfyUI.
The implementation of Jina CLIP v2 is put in a single py file, in a way similar to the existing BERT and Llama/Gemma. The weights and the tokenizer are also packaged in a single safetensors file. I've tested that it produces the same clip_text_pooled as the official Jina CLIP v2 (within some floating point error).
Here is an image generated using this PR, with a simple workflow in it:
-
NewBie-Image-Exp0.1.safetensorsis downloaded from https://huggingface.co/NewBie-AI/NewBie-image-Exp0.1/blob/main/transformer/diffusion_pytorch_model.safetensors -
gemma_3_4b_it_bf16.safetensorsis downloaded from https://huggingface.co/woctordho/comfyui-gemma-3-4b-it/blob/main/gemma_3_4b_it_bf16.safetensors -
jina_clip_v2_bf16.safetensorsis downloaded from https://huggingface.co/woctordho/comfyui-jina-clip-v2/blob/main/jina_clip_v2_bf16.safetensors - The prompt is copied from https://civitai.com/images/112919154
If I understand correctly, both Gemma and Jina need the system prompt written in the CLIPTextEncode node:
You are an assistant designed to generate high-quality anime images with the highest degree of image-text alignment based on xml format textual prompts. <Prompt Start>
otherwise the generated image will be garbage.
After this PR is merged, I can make another PR to support checkpoint loader (all-in-one model).