gguf-py: Support for Qwen2.5 VL - DRAFT (#11483)

Open vladislavdonchev opened this issue 11 months ago • 1 comments

I've been trying to get the surgery working reliably for a few days, but without luck.

The vision model produced by the latest version of the script (as of this PR) crashes llama-qwen2vl-cli:

clip_model_load: model name:   Qwen2.5-VL-7B-Instruct
clip_model_load: description:  Image encoder for Qwen2.5VL
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    520
clip_model_load: n_kv:         22
clip_model_load: ftype:        f32

clip_model_load: loaded meta data with 22 key-value pairs and 520 tensors from C:/Users/vdonc/gguf/mmproj2-Qwen2.5-VL-7B-Instruct-F32.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                        general.description str              = Image encoder for Qwen2.5VL
clip_model_load: - kv   2:                          general.file_type u32              = 0
clip_model_load: - kv   3:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   4:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   5:                    clip.has_qwen2vl_merger bool             = true
clip_model_load: - kv   6:                        clip.projector_type str              = qwen2vl_merger
clip_model_load: - kv   7:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 560
clip_model_load: - kv   9:                 clip.vision.projection_dim u32              = 1536
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1280
clip_model_load: - kv  11:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  12:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  13:                    clip.vision.block_count u32              = 32
clip_model_load: - kv  14:            clip.vision.feed_forward_length u32              = 0
clip_model_load: - kv  15:                               general.name str              = Qwen2.5-VL-7B-Instruct
clip_model_load: - kv  16:            clip.vision.mm_patch_merge_type str              = qwen2vl_merger
clip_model_load: - kv  17:          clip.vision.image_crop_resolution u32              = 560
clip_model_load: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  20:                              clip.use_silu bool             = true
clip_model_load: - kv  21:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  520 tensors
clip_model_load: CLIP using Vulkan backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  0
clip_model_load: glm_projector:  0
clip_model_load: model size:     2580.83 MB
clip_model_load: metadata size:  0.18 MB
clip_model_load: params backend buffer size =  2580.83 MB (520 tensors)
key clip.vision.image_grid_pinpoints not found in file
sh: llama-qwen2vl-cli.exe: The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application. Error 0xc0000409

In fact, whatever I do, I am unable to produce a working VLM GGUF since last week... Here is a broken "mmproj" extracted from 2.5 7B by qwen2_5_vl_surgery.py: https://huggingface.co/IAILabs/Qwen2.5-VL-7B-Instruct-GGUF/blob/main/placeholder.gguf

The LM parts, including the quantized versions all seem to be working fine.

If anyone can assist with debugging the model and llama-qwen2vl-cli that'd be awesome! I'm also open to any other suggestions, as we're probably missing something obvious here...

Feb 28 '25 22:02 vladislavdonchev

This will be reopened as a follow-up to the vision API refactor: https://github.com/ggml-org/llama.cpp/pull/11292

Mar 05 '25 22:03 vladislavdonchev