Add support for GGUF Phi-3
What does this PR do?
Add support for GGUF Phi-3 Use https://github.com/huggingface/transformers/pull/31175 as a guide for adding gguf support for phi3
Fixes https://github.com/huggingface/transformers/issues/31826
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@khalidmohammedet
cc @SunMarc
Hi @SunMarc @amyeroberts - while testing my changes I was running into a failure. I checked Qwen2's test and it also has same failure. I am attaching error below, I am running this on a colab T4 GPU, any ideas why qwen2's GGUF loading also failing for me? Command to repro below failure: python -m pytest -v ./tests/quantization/ggml/test_ggml.py::GgufIntegrationTests::test_qwen2_q4_0. Thanks!
============================================= FAILURES =============================================
_______________________________ GgufIntegrationTests.test_qwen2_q4_0 _______________________________
self = <ggml.test_ggml.GgufIntegrationTests testMethod=test_qwen2_q4_0>
def test_qwen2_q4_0(self):
tokenizer = AutoTokenizer.from_pretrained(self.qwen2_model_id, gguf_file=self.q4_0_qwen2_model_id)
> model = AutoModelForCausalLM.from_pretrained(
self.qwen2_model_id, gguf_file=self.q4_0_qwen2_model_id, device_map="auto", torch_dtype=torch.float16
)
tests/quantization/ggml/test_ggml.py:166:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/models/auto/auto_factory.py:564: in from_pretrained
return model_class.from_pretrained(
src/transformers/modeling_utils.py:3583: in from_pretrained
state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
src/transformers/modeling_gguf_pytorch_utils.py:146: in load_gguf_checkpoint
weights = load_dequant_gguf_tensor(shape=shape, ggml_type=tensor.tensor_type, data=tensor.data)
src/transformers/integrations/ggml.py:526: in load_dequant_gguf_tensor
values = dequantize_q6_k(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = memmap([[192, 8, 15, ..., 116, 211, 128],
[144, 213, 102, ..., 56, 14, 129],
[ 35, 71, 212, ...,...225, 22, 1],
[112, 36, 159, ..., 225, 22, 1],
[112, 36, 143, ..., 225, 22, 1]], dtype=uint8)
def dequantize_q6_k(data):
# C implementation
# https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.c#L2275
# C struct definition
# https://github.com/ggerganov/ggml/blob/fca1caafea7de9fbd7efc733b9818f9cf2da3050/src/ggml-quants.h#L152
block_size = GGML_BLOCK_SIZES["Q6_K"]
num_blocks = len(data) // block_size
> data_f16 = np.frombuffer(data, dtype=np.float16).reshape(num_blocks, block_size // 2)
E ValueError: cannot reshape array of size 63813120 into shape (723,105)
src/transformers/integrations/ggml.py:311: ValueError
--------------------------------------- Captured stderr call ---------------------------------------
Converting and de-quantizing GGUF tensors...: 0%| | 0/291 [00:00<?, ?it/s]
========================================= warnings summary =========================================
../../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1373
/usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================== short test summary info ======================================
Hi @a8nova ! The latest version of ggml that was released a few days ago is not compatible with the integration. We still need to fix this ! https://github.com/huggingface/transformers/issues/31725#issuecomment-2225386163. I will try to find some time to fix this but feel free to have a look ! In the meantime, you can also install the previous version of ggml and continue the PR from there !
got it, I will try with previous ggml version, thank you @SunMarc!
Hi @SunMarc - I am getting giberrish output for my test, i think it has to do with phi3 having a slightly different attention class where q, k, and v are one variable.
I am seeing below output while loading GGUF weights and since the model output differs on runs it looks like some weights are not being correctly initialized. Right now I am mapping attn_q, attn_v and attn_k to self_attn.qkv_proj but that feels incorrect or underlying code probably needs to support this attention mechanism?
Converting and de-quantizing GGUF tensors...: 100%|██████████| 195/195 [00:31<00:00, 6.25it/s]
Some weights of Phi3ForCausalLM were not initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct-gguf and are newly initialized: ['model.layers.0.self_attn.qkv_proj.weight', 'model.layers.1.self_attn.qkv_proj.weight', 'model.layers.10.self_attn.qkv_proj.weight', 'model.layers.11.self_attn.qkv_proj.weight', 'model.layers.12.self_attn.qkv_proj.weight', 'model.layers.13.self_attn.qkv_proj.weight', 'model.layers.14.self_attn.qkv_proj.weight', 'model.layers.15.self_attn.qkv_proj.weight', 'model.layers.16.self_attn.qkv_proj.weight', 'model.layers.17.self_attn.qkv_proj.weight', 'model.layers.18.self_attn.qkv_proj.weight', 'model.layers.19.self_attn.qkv_proj.weight', 'model.layers.2.self_attn.qkv_proj.weight', 'model.layers.20.self_attn.qkv_proj.weight', 'model.layers.21.self_attn.qkv_proj.weight', 'model.layers.22.self_attn.qkv_proj.weight', 'model.layers.23.self_attn.qkv_proj.weight', 'model.layers.24.self_attn.qkv_proj.weight', 'model.layers.25.self_attn.qkv_proj.weight', 'model.layers.26.self_attn.qkv_proj.weight', 'model.layers.27.self_attn.qkv_proj.weight', 'model.layers.28.self_attn.qkv_proj.weight', 'model.layers.29.self_attn.qkv_proj.weight', 'model.layers.3.self_attn.qkv_proj.weight', 'model.layers.30.self_attn.qkv_proj.weight', 'model.layers.31.self_attn.qkv_proj.weight', 'model.layers.4.self_attn.qkv_proj.weight', 'model.layers.5.self_attn.qkv_proj.weight', 'model.layers.6.self_attn.qkv_proj.weight', 'model.layers.7.self_attn.qkv_proj.weight', 'model.layers.8.self_attn.qkv_proj.weight', 'model.layers.9.self_attn.qkv_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
For the tokenizer, I am using Llama's tokenizer since phi3 uses Llama tokenizer, that is what I read in the research paper, I have added the special tokens for phi3, when I dump the tokenizer from GGUF, it looks similar to tokenizer loaded by AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") so I don't think there is any issue there
Hi! I’m interested in working on this issue. Is anyone currently working on it? If not, I’d love to take it on as my first one.
Hi @kibru9399 - I am actively working on this, it is actually almost finalized.
No worries, thanks for the update!
On Mon, Aug 19, 2024 at 1:22 PM Alazar @.***> wrote:
Hi @kibru9399 https://github.com/kibru9399 - I am actively working on this, it is actually almost finalized.
— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/pull/31844#issuecomment-2297064733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYQVVC7ZQBLGOXE7UYMD773ZSISU5AVCNFSM6AAAAABKREQWKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJXGA3DINZTGM . You are receiving this because you were mentioned.Message ID: @.***>
Hello @SunMarc @amyeroberts - This PR is ready for review!
A few remaining TODO's on my end:
- The generated output from gguf model
microsoft/Phi-3-mini-4k-instruct-ggufis different from non gguf modelmicrosoft/Phi-3-mini-4k-instruct. This is also true for Qwen2. This can be due to many different reasons but I want to make sure it is not due to a bug in the conversion code in this PR. - Test this PR with phi3.5 weights
- There is a warning
Merges were not in checkpoint, building merges on the fly.formicrosoft/Phi-3-mini-4k-instruct-gguf, do you know if i need to look into this warning?
Thanks!
LMK when you finished your TODO's, so that I can ask for a final review from a core maintainer !
Hi @SunMarc - For trying phi3.5 weights: There seems to be something wrong with phi3.5 template because model gives weird output. For example with input: Can you provide ways to eat combinations of bananas and dragonfruits?"
Model outputs:
Output:
<s>Canyouprovidewaystoeatcombinationsofbananasanddragonfruits?
Assistant:
Certainly! Here's a simple recipe for a banana and dragon fruit smoothie:
Ingredients:
- 1 ripe banana, peeled and sliced
- 1/2 cup of diced dragon fruit (also known as pitaya)
- 1
And with input "Hello", model gives (HTML) output:
Output:
<s>Hello, World!</h1>
<p>This is a paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>dict
<script>
console.log("Hello, World!");
</script>
<img src="image.jpg
My understanding was that phi 3.5 is identical to phi3 for its architecture but there seems to be some differences and I don't see it documented anywhere.. I won't worry about this issue in current scope.
For merges missing issue, I don't see a merges in the ggml file though it exists in the original checkpoints. I see the merges are built if they don't exist in the ggml checkpoints which slows things down. Is the merges missing a bug in the safetensors -> ggml conversion? how do you suggest we move forward? ( I also tried converting via ggml-my-repo a few phi3 variants but they are still missing the merges in the converted ggml files)
if the merges missing from checkpoint is not a blocker then I think this PR is ready for a final review, if it is a blocker we can get to the bottom of it first. Thanks!
For merges missing issue, I don't see a merges in the ggml file though it exists in the original checkpoints. I see the merges are built if they don't exist in the ggml checkpoints which slows things down. Is the merges missing a bug in the safetensors -> ggml conversion? how do you suggest we move forward? ( I also tried converting via ggml-my-repo a few phi3 variants but they are still missing the merges in the converted ggml files)
if the merges missing from checkpoint is not a blocker then I think this PR is ready for a final review, if it is a blocker we can get to the bottom of it first. Thanks!
Could you check quickly if we have the issue on a llama gguf model ? If not, I think that we should probably fix this issue with a PR in llama.cpp repo. We would have to fix this conversion script https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py.
Hi @SunMarc! When I try via an already converted gguf llama like TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF which is also used inside test_ggml.py, I don't see the missing merges. But a llama gguf I just converted using gguf-my-repo: a8nova/TinyLlama-1.1B-Chat-v1.0-Q4_0-GGUF is missing merges. So it looks like a recent change (in llama.cpp repo?) broke the conversion for all models...
Hi @SunMarc! When I try via an already converted gguf llama like TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF which is also used inside test_ggml.py, I don't see the missing merges. But a llama gguf I just converted using gguf-my-repo: a8nova/TinyLlama-1.1B-Chat-v1.0-Q4_0-GGUF is missing merges. So it looks like a recent change (in llama.cpp repo?) broke the conversion for all models...
Thanks for the investigation ! We should definitely try to fix this as this can be quite troublesome if we have to recreate the merges for all recent llama models.
I am able to checkout llama.cpp and repro the missing merges bug locally and I also have a fix.
The bug shows up for the llama family models only. I have narrowed it down to the _set_vocab_llama_hf() routine inside convert_hf_to_gguf.py#L806. I am able to fix it by passing load_merges=True to that line like:
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True, n_vocab=len(tokens))
It looks like we will only go inside _set_vocab_llama_hf() if the self._set_vocab_sentencepiece() which is wrapped by a try-catch inside the LlamaModel class fails which it does in my case since there is no tokenizer.model file for the llama model or phi3 but there is a tokenizer.json.
If this fix makes sense to you, I create a PR in the llama.cpp repo!
Update: I think the bug can also happen for any model class that calls _set_vocab_sentencepiece(). There is also the case where a tokenizer.model file is present in which case _create_vocab_sentencepiece() never throws an exception, and when we are back in _set_vocab_sentencepiece() load_merges is also not passed as True here, so this would be another place we would have to fix it
I would first open an issue on llama.cpp to see if anyone there have more insights around that + explain the potential fix ! Thanks for investigating ! I really appreciate that.
Done! https://github.com/ggerganov/llama.cpp/issues/9309
-
I noticed that ggml.py in this repo builds the merges if they're missing. Is the primary drawback of missing merges just the performance slowdown, or could it lead to other issues as well?
-
Any ideas why phi3.5 is not working?
I noticed that ggml.py in this repo builds the merges if they're missing. Is the primary drawback of missing merges just the performance slowdown, or could it lead to other issues as well?
Well, it will just lead to increased loading time.
Any ideas why phi3.5 is not working?
Not really. Can you check that the phi3.5 on transformers works correctly? If that works correctly, you can try to gguf it then load it in transformers.
Hi @SunMarc! I would like to finalize this PR if possible, are there any blockers or can we get final reviews and get this merged in? Thanks!
Regarding phi3.5, non gguf checkpoints have same behavior, something is off about the template/prompt
No blockers @a8nova. I really appreciate your work !
As for the merges issue, this is something we need to tackle separately from this PR. Can you create a separate issue for phi3.5 ? It seems that something is not working on transformers and we need to fix this first. Then, the gguf issue will be solved naturally.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Thank you, @SunMarc! Before I file an issue for the phi3.5 bug, let me first confirm. I am suspicious of what I saw, maybe I did something wrong. Also, I am curious, what is the purpose of requiring torch to run on the GPU in test_ggml.py? I have been running the tests locally on my CPU.. https://github.com/huggingface/transformers/blob/7f112caac2ff365c3d6e0020fefe8c1300311e07/tests/quantization/ggml/test_ggml.py#L33-L36
Also, I am curious, what is the purpose of requiring torch to run on the GPU in test_ggml.py? I have been running the tests locally on my CPU
I guess it is simply to test the model faster. Maybe you can try on GPU to see if this fixes anything ?