gguf-py: Support for Qwen2.5 VL - DRAFT (#11483)
I've been trying to get the surgery working reliably for a few days, but without luck.
The vision model produced by the latest version of the script (as of this PR) crashes llama-qwen2vl-cli:
clip_model_load: model name: Qwen2.5-VL-7B-Instruct
clip_model_load: description: Image encoder for Qwen2.5VL
clip_model_load: GGUF version: 3
clip_model_load: alignment: 32
clip_model_load: n_tensors: 520
clip_model_load: n_kv: 22
clip_model_load: ftype: f32
clip_model_load: loaded meta data with 22 key-value pairs and 520 tensors from C:/Users/vdonc/gguf/mmproj2-Qwen2.5-VL-7B-Instruct-F32.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv 0: general.architecture str = clip
clip_model_load: - kv 1: general.description str = Image encoder for Qwen2.5VL
clip_model_load: - kv 2: general.file_type u32 = 0
clip_model_load: - kv 3: clip.has_text_encoder bool = false
clip_model_load: - kv 4: clip.has_vision_encoder bool = true
clip_model_load: - kv 5: clip.has_qwen2vl_merger bool = true
clip_model_load: - kv 6: clip.projector_type str = qwen2vl_merger
clip_model_load: - kv 7: clip.vision.patch_size u32 = 14
clip_model_load: - kv 8: clip.vision.image_size u32 = 560
clip_model_load: - kv 9: clip.vision.projection_dim u32 = 1536
clip_model_load: - kv 10: clip.vision.embedding_length u32 = 1280
clip_model_load: - kv 11: clip.vision.attention.head_count u32 = 16
clip_model_load: - kv 12: clip.vision.attention.layer_norm_epsilon f32 = 0.000001
clip_model_load: - kv 13: clip.vision.block_count u32 = 32
clip_model_load: - kv 14: clip.vision.feed_forward_length u32 = 0
clip_model_load: - kv 15: general.name str = Qwen2.5-VL-7B-Instruct
clip_model_load: - kv 16: clip.vision.mm_patch_merge_type str = qwen2vl_merger
clip_model_load: - kv 17: clip.vision.image_crop_resolution u32 = 560
clip_model_load: - kv 18: clip.vision.image_mean arr[f32,3] = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv 19: clip.vision.image_std arr[f32,3] = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv 20: clip.use_silu bool = true
clip_model_load: - kv 21: clip.use_gelu bool = false
clip_model_load: - type f32: 520 tensors
clip_model_load: CLIP using Vulkan backend
clip_model_load: text_encoder: 0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector: 0
clip_model_load: minicpmv_projector: 0
clip_model_load: glm_projector: 0
clip_model_load: model size: 2580.83 MB
clip_model_load: metadata size: 0.18 MB
clip_model_load: params backend buffer size = 2580.83 MB (520 tensors)
key clip.vision.image_grid_pinpoints not found in file
sh: llama-qwen2vl-cli.exe: The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application. Error 0xc0000409
In fact, whatever I do, I am unable to produce a working VLM GGUF since last week... Here is a broken "mmproj" extracted from 2.5 7B by qwen2_5_vl_surgery.py: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF/blob/main/placeholder.gguf
The LM parts, including the quantized versions all seem to be working fine.
If anyone can assist with debugging the model and llama-qwen2vl-cli that'd be awesome! I'm also open to any other suggestions, as we're probably missing something obvious here...
This will be reopened as a follow-up to the vision API refactor: https://github.com/ggml-org/llama.cpp/pull/11292