What does this PR do?

Add support for GGUF Phi-3 Use https://github.com/huggingface/transformers/pull/31175 as a guide for adding gguf support for phi3

Fixes https://github.com/huggingface/transformers/issues/31826

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Jul 08 '24 16:07 a8nova

@khalidmohammedet

Jul 08 '24 16:07 a8nova

cc @SunMarc

Jul 08 '24 17:07 amyeroberts

Hi @SunMarc @amyeroberts - while testing my changes I was running into a failure. I checked Qwen2's test and it also has same failure. I am attaching error below, I am running this on a colab T4 GPU, any ideas why qwen2's GGUF loading also failing for me? Command to repro below failure: python -m pytest -v ./tests/quantization/ggml/test_ggml.py::GgufIntegrationTests::test_qwen2_q4_0. Thanks!

============================================= FAILURES =============================================
_______________________________ GgufIntegrationTests.test_qwen2_q4_0 _______________________________

self = <ggml.test_ggml.GgufIntegrationTests testMethod=test_qwen2_q4_0>

    def test_qwen2_q4_0(self):
        tokenizer = AutoTokenizer.from_pretrained(self.qwen2_model_id, gguf_file=self.q4_0_qwen2_model_id)
>       model = AutoModelForCausalLM.from_pretrained(
            self.qwen2_model_id, gguf_file=self.q4_0_qwen2_model_id, device_map="auto", torch_dtype=torch.float16
        )

tests/quantization/ggml/test_ggml.py:166: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/transformers/models/auto/auto_factory.py:564: in from_pretrained
    return model_class.from_pretrained(
src/transformers/modeling_utils.py:3583: in from_pretrained
    state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
src/transformers/modeling_gguf_pytorch_utils.py:146: in load_gguf_checkpoint
    weights = load_dequant_gguf_tensor(shape=shape, ggml_type=tensor.tensor_type, data=tensor.data)
src/transformers/integrations/ggml.py:526: in load_dequant_gguf_tensor
    values = dequantize_q6_k(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

data = memmap([[192,   8,  15, ..., 116, 211, 128],
        [144, 213, 102, ...,  56,  14, 129],
        [ 35,  71, 212, ...,...225,  22,   1],
        [112,  36, 159, ..., 225,  22,   1],
        [112,  36, 143, ..., 225,  22,   1]], dtype=uint8)

    def dequantize_q6_k(data):
        # C implementation
        # https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.c#L2275
        # C struct definition
        # https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.h#L152
        block_size = GGML_BLOCK_SIZES["Q6_K"]
        num_blocks = len(data) // block_size
    
>       data_f16 = np.frombuffer(data, dtype=np.float16).reshape(num_blocks, block_size // 2)
E       ValueError: cannot reshape array of size 63813120 into shape (723,105)

src/transformers/integrations/ggml.py:311: ValueError
--------------------------------------- Captured stderr call ---------------------------------------
Converting and de-quantizing GGUF tensors...:   0%|          | 0/291 [00:00<?, ?it/s]
========================================= warnings summary =========================================
../../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1373
  /usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================== short test summary info ======================================

Jul 15 '24 16:07 a8nova

Hi @a8nova ! The latest version of ggml that was released a few days ago is not compatible with the integration. We still need to fix this ! https://github.com/huggingface/transformers/issues/31725#issuecomment-2225386163. I will try to find some time to fix this but feel free to have a look ! In the meantime, you can also install the previous version of ggml and continue the PR from there !

Jul 16 '24 11:07 SunMarc

got it, I will try with previous ggml version, thank you @SunMarc!

Jul 17 '24 11:07 a8nova

Hi @SunMarc - I am getting giberrish output for my test, i think it has to do with phi3 having a slightly different attention class where q, k, and v are one variable.

I am seeing below output while loading GGUF weights and since the model output differs on runs it looks like some weights are not being correctly initialized. Right now I am mapping attn_q, attn_v and attn_k to self_attn.qkv_proj but that feels incorrect or underlying code probably needs to support this attention mechanism?

Converting and de-quantizing GGUF tensors...: 100%|██████████| 195/195 [00:31<00:00,  6.25it/s]
Some weights of Phi3ForCausalLM were not initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct-gguf and are newly initialized: ['model.layers.0.self_attn.qkv_proj.weight', 'model.layers.1.self_attn.qkv_proj.weight', 'model.layers.10.self_attn.qkv_proj.weight', 'model.layers.11.self_attn.qkv_proj.weight', 'model.layers.12.self_attn.qkv_proj.weight', 'model.layers.13.self_attn.qkv_proj.weight', 'model.layers.14.self_attn.qkv_proj.weight', 'model.layers.15.self_attn.qkv_proj.weight', 'model.layers.16.self_attn.qkv_proj.weight', 'model.layers.17.self_attn.qkv_proj.weight', 'model.layers.18.self_attn.qkv_proj.weight', 'model.layers.19.self_attn.qkv_proj.weight', 'model.layers.2.self_attn.qkv_proj.weight', 'model.layers.20.self_attn.qkv_proj.weight', 'model.layers.21.self_attn.qkv_proj.weight', 'model.layers.22.self_attn.qkv_proj.weight', 'model.layers.23.self_attn.qkv_proj.weight', 'model.layers.24.self_attn.qkv_proj.weight', 'model.layers.25.self_attn.qkv_proj.weight', 'model.layers.26.self_attn.qkv_proj.weight', 'model.layers.27.self_attn.qkv_proj.weight', 'model.layers.28.self_attn.qkv_proj.weight', 'model.layers.29.self_attn.qkv_proj.weight', 'model.layers.3.self_attn.qkv_proj.weight', 'model.layers.30.self_attn.qkv_proj.weight', 'model.layers.31.self_attn.qkv_proj.weight', 'model.layers.4.self_attn.qkv_proj.weight', 'model.layers.5.self_attn.qkv_proj.weight', 'model.layers.6.self_attn.qkv_proj.weight', 'model.layers.7.self_attn.qkv_proj.weight', 'model.layers.8.self_attn.qkv_proj.weight', 'model.layers.9.self_attn.qkv_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

For the tokenizer, I am using Llama's tokenizer since phi3 uses Llama tokenizer, that is what I read in the research paper, I have added the special tokens for phi3, when I dump the tokenizer from GGUF, it looks similar to tokenizer loaded by AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") so I don't think there is any issue there

Jul 22 '24 12:07 a8nova

Hi! I’m interested in working on this issue. Is anyone currently working on it? If not, I’d love to take it on as my first one.

Aug 19 '24 16:08 kibru9399

Hi @kibru9399 - I am actively working on this, it is actually almost finalized.

Aug 19 '24 17:08 a8nova

No worries, thanks for the update!

On Mon, Aug 19, 2024 at 1:22 PM Alazar @.***> wrote:

Hi @kibru9399 https://github.com/kibru9399 - I am actively working on this, it is actually almost finalized.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/pull/31844#issuecomment-2297064733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYQVVC7ZQBLGOXE7UYMD773ZSISU5AVCNFSM6AAAAABKREQWKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJXGA3DINZTGM . You are receiving this because you were mentioned.Message ID: @.***>

Aug 19 '24 17:08 kibru9399

Hello @SunMarc @amyeroberts - This PR is ready for review!

A few remaining TODO's on my end:

The generated output from gguf model microsoft/Phi-3-mini-4k-instruct-gguf is different from non gguf model microsoft/Phi-3-mini-4k-instruct. This is also true for Qwen2. This can be due to many different reasons but I want to make sure it is not due to a bug in the conversion code in this PR.
Test this PR with phi3.5 weights
There is a warning Merges were not in checkpoint, building merges on the fly. for microsoft/Phi-3-mini-4k-instruct-gguf, do you know if i need to look into this warning?

Thanks!

Aug 25 '24 19:08 a8nova

LMK when you finished your TODO's, so that I can ask for a final review from a core maintainer !

Aug 26 '24 13:08 SunMarc

Hi @SunMarc - For trying phi3.5 weights: There seems to be something wrong with phi3.5 template because model gives weird output. For example with input: Can you provide ways to eat combinations of bananas and dragonfruits?" Model outputs:

Output:
 <s>Canyouprovidewaystoeatcombinationsofbananasanddragonfruits?

Assistant:
Certainly! Here's a simple recipe for a banana and dragon fruit smoothie:

Ingredients:
- 1 ripe banana, peeled and sliced
- 1/2 cup of diced dragon fruit (also known as pitaya)
- 1

And with input "Hello", model gives (HTML) output:

Output:
 <s>Hello, World!</h1>

<p>This is a paragraph.</p>

<ul>
  <li>Item 1</li>
  <li>Item 2</li>dict

<script>
  console.log("Hello, World!");
</script>

<img src="image.jpg

My understanding was that phi 3.5 is identical to phi3 for its architecture but there seems to be some differences and I don't see it documented anywhere.. I won't worry about this issue in current scope.

For merges missing issue, I don't see a merges in the ggml file though it exists in the original checkpoints. I see the merges are built if they don't exist in the ggml checkpoints which slows things down. Is the merges missing a bug in the safetensors -> ggml conversion? how do you suggest we move forward? ( I also tried converting via ggml-my-repo a few phi3 variants but they are still missing the merges in the converted ggml files)

if the merges missing from checkpoint is not a blocker then I think this PR is ready for a final review, if it is a blocker we can get to the bottom of it first. Thanks!

Aug 28 '24 16:08 a8nova

For merges missing issue, I don't see a merges in the ggml file though it exists in the original checkpoints. I see the merges are built if they don't exist in the ggml checkpoints which slows things down. Is the merges missing a bug in the safetensors -> ggml conversion? how do you suggest we move forward? ( I also tried converting via ggml-my-repo a few phi3 variants but they are still missing the merges in the converted ggml files)

if the merges missing from checkpoint is not a blocker then I think this PR is ready for a final review, if it is a blocker we can get to the bottom of it first. Thanks!

Could you check quickly if we have the issue on a llama gguf model ? If not, I think that we should probably fix this issue with a PR in llama.cpp repo. We would have to fix this conversion script https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py.

Sep 02 '24 17:09 SunMarc

Hi @SunMarc! When I try via an already converted gguf llama like TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF which is also used inside test_ggml.py, I don't see the missing merges. But a llama gguf I just converted using gguf-my-repo: a8nova/TinyLlama-1.1B-Chat-v1.0-Q4_0-GGUF is missing merges. So it looks like a recent change (in llama.cpp repo?) broke the conversion for all models...

Sep 02 '24 21:09 a8nova

Hi @SunMarc! When I try via an already converted gguf llama like TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF which is also used inside test_ggml.py, I don't see the missing merges. But a llama gguf I just converted using gguf-my-repo: a8nova/TinyLlama-1.1B-Chat-v1.0-Q4_0-GGUF is missing merges. So it looks like a recent change (in llama.cpp repo?) broke the conversion for all models...

Thanks for the investigation ! We should definitely try to fix this as this can be quite troublesome if we have to recreate the merges for all recent llama models.

Sep 03 '24 12:09 SunMarc

I am able to checkout llama.cpp and repro the missing merges bug locally and I also have a fix.

The bug shows up for the llama family models only. I have narrowed it down to the _set_vocab_llama_hf() routine inside convert_hf_to_gguf.py#L806. I am able to fix it by passing load_merges=True to that line like:

special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, n_vocab=len(tokens))

It looks like we will only go inside _set_vocab_llama_hf() if the self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json.

If this fix makes sense to you, I create a PR in the llama.cpp repo!

Update: I think the bug can also happen for any model class that calls _set_vocab_sentencepiece(). There is also the case where a tokenizer.model file is present in which case _create_vocab_sentencepiece() never throws an exception, and when we are back in _set_vocab_sentencepiece() load_merges is also not passed as True here, so this would be another place we would have to fix it

Sep 03 '24 18:09 a8nova

I would first open an issue on llama.cpp to see if anyone there have more insights around that + explain the potential fix ! Thanks for investigating ! I really appreciate that.

Sep 04 '24 11:09 SunMarc

Done! https://github.com/ggerganov/llama.cpp/issues/9309

I noticed that ggml.py in this repo builds the merges if they're missing. Is the primary drawback of missing merges just the performance slowdown, or could it lead to other issues as well?
Any ideas why phi3.5 is not working?

Sep 04 '24 13:09 a8nova

I noticed that ggml.py in this repo builds the merges if they're missing. Is the primary drawback of missing merges just the performance slowdown, or could it lead to other issues as well?

Well, it will just lead to increased loading time.

Any ideas why phi3.5 is not working?

Not really. Can you check that the phi3.5 on transformers works correctly? If that works correctly, you can try to gguf it then load it in transformers.

Sep 04 '24 15:09 SunMarc

Hi @SunMarc! I would like to finalize this PR if possible, are there any blockers or can we get final reviews and get this merged in? Thanks!

Regarding phi3.5, non gguf checkpoints have same behavior, something is off about the template/prompt

Sep 06 '24 12:09 a8nova

No blockers @a8nova. I really appreciate your work !

As for the merges issue, this is something we need to tackle separately from this PR. Can you create a separate issue for phi3.5 ? It seems that something is not working on transformers and we need to fix this first. Then, the gguf issue will be solved naturally.

Sep 09 '24 13:09 SunMarc

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sep 09 '24 13:09 HuggingFaceDocBuilderDev

Thank you, @SunMarc! Before I file an issue for the phi3.5 bug, let me first confirm. I am suspicious of what I saw, maybe I did something wrong. Also, I am curious, what is the purpose of requiring torch to run on the GPU in test_ggml.py? I have been running the tests locally on my CPU.. https://github.com/huggingface/transformers/blob/7f112caac2ff365c3d6e0020fefe8c1300311e07/tests/quantization/ggml/test_ggml.py#L33-L36

Sep 10 '24 09:09 a8nova

Also, I am curious, what is the purpose of requiring torch to run on the GPU in test_ggml.py? I have been running the tests locally on my CPU

I guess it is simply to test the model faster. Maybe you can try on GPU to see if this fixes anything ?

Sep 10 '24 11:09 SunMarc