Expected Behavior

Hello,

I wanted to convert the alpaca-native 7b GPTQ file (pt file) into a ggml file with the convert-gptq-to-ggml.py script https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

Current Behavior

The problem is that I have this error

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python convert-gptq-to-ggml.py alpaca-native-4b
it.pt tokenizer.model out.bin
32000
32001
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 35, in <
module>
    assert tokenizer.vocab_size() == n_vocab
AssertionError

32000 is the tokenizer.vocab_size() (Number of tokens on the tokenizer.model) 32001 is the n_vocab (Number of tokens on the model)

The model that is trained with alpaca has 1 more token and it's this one: "[PAD]": 32000

It looks like that if we want to convert the alpaca native GPTQ models we need to create a new tokenizer.model that has this "PAD" token in it.

The problem is that I have no idea how to do that... if someone can help me on this I'll appreciate!

Mar 23 '23 22:03 BadisG

I wrote a tool to add additional tokens to tokenizer.model: https://github.com/Ronsor/llama-tools

The token list:

C [PAD]

would work with the script I wrote.

Mar 23 '23 23:03 Ronsor

@Ronsor I used your script and it looks like it did actually add the token on the tokenizer.model

But now I have a new error... looks like the issue is more complex than I thought 😅

D:\Large Language Models\CONVERTISSEURS\gptq to ggml>python convert-gptq-to-ggml.py alpaca-native-4b
it.pt tokenizer.model out.bin
32001
32001
Processing non-Q4 variable: model.embed_tokens.weight with shape:  torch.Size([32001, 4096])  and ty
pe:  torch.float32
Processing non-Q4 variable: model.norm.weight with shape:  torch.Size([4096])  and type:  torch.floa
t32
  Converting to float32
Processing non-Q4 variable: lm_head.weight with shape:  torch.Size([32001, 4096])  and type:  torch.
float32
Traceback (most recent call last):
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 153, in
<module>
    convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True
)
  File "D:\Large Language Models\CONVERTISSEURS\gptq to ggml\convert-gptq-to-ggml.py", line 94, in c
onvert_q4
    zeros = model[f"{src_name}.zeros"].numpy()
KeyError: 'model.layers.0.self_attn.q_proj.zeros'

Mar 23 '23 23:03 BadisG

Looks like the zeros issue corresponds to a recent commit to GPTQ-for-LLaMa (with a very non-descriptive commit message) which changed the format. The change is not actually specific to Alpaca, but the alpaca-native-GPTQ weights published online were apparently produced with a later version of GPTQ-for-LLaMa. The zeros and scales are now separate for every group of 32 weights, but the zeros are now themselves scaled and quantized… I don't really understand how that makes sense. I'll figure it out when I have a chance.

Mar 24 '23 03:03 comex

On the convert_q4(src_name, dst_name, permute=False): function I changed:

    zeros = model[f"{src_name}.zeros"].numpy()
    ...
    qweight = model[f"{src_name}.weight"].numpy().T # transpose

to

    zeros = model[f"{src_name}.qzeros"].numpy()
    ...
    qweight = model[f"{src_name}.qweight"].numpy().T # transpose

That results in those dimensions.

    print(grouped.shape) -> (4096, 128, 4)
    print(scales_rep.shape) -> (32, 524288, 1)
    print(addends_rep.shape) -> (32, 65536, 1)

Which gives an error because we cannot concanetate those objects anymore.

Here's a comparaison with the regular llama-7b-gptq model (that works well with the converter)

    print(grouped.shape) -> (4096, 128, 4)
    print(scales_rep.shape) -> (4096, 128, 1)
    print(addends_rep.shape) -> (4096, 128, 1)

At this point I'm stuck, as I'm uncertain about which elements (groupings, scales, addends) to modify in order to achieve the desired concatenation

Mar 24 '23 12:03 BadisG

@comex I'm not sure it was a good idea to convert your addends and scales into int32, those tensors have really small numbers and we're loosing all the informations like that:

Mar 25 '23 03:03 BadisG

They're not 'really' int32s. Each int32 is actually 8 4-bit weights packed together. And they're not converted directly from float to integer; they have to be interpreted together with the addends and scales.

Mar 25 '23 04:03 comex

maybe you are lucky with this one? https://huggingface.co/ozcur/alpaca-native-4bit/tree/main maybe this was generated just before the zeros patch was merged.

Mar 25 '23 09:03 daboe01

maybe you are lucky with this one? https://huggingface.co/ozcur/alpaca-native-4bit/tree/main maybe this was generated just before the zeros patch was merged.

Just tried, it fails with KeyError: 'model.layers.0.self_attn.q_proj.zeros'

Mar 25 '23 14:03 Belluxx

I spent some time today working on this but didn't finish.

Mar 26 '23 00:03 comex

oobabooga merged a PR that makes the alpaca-7b-4bit-GPTQ-native works now https://github.com/oobabooga/text-generation-webui/commit/49c10c5570b595e9d4fdcb496c456a9982ede070

That's funny it worked because it uses the exact same tokenizer model (the one with 32000 token) even though this model has one more.

Mar 26 '23 12:03 BadisG

cool! do you see any significant improvements from GPTQ?

Mar 26 '23 12:03 daboe01

@daboe01 I have the RTN quantized on llama cpp and the GPTQ quantized on the webui but it would be hard to compare the 2 as they are a bit differents in the way they work.

The best comparaison would be RTN vs GPTQ in llama.cpp with a perplexity test, I'll wait for @comex to do his magic! 👀

Mar 26 '23 12:03 BadisG

PR is up; please try it and let me know if there are issues.

The PR consists of a new script which is meant to replace the existing ones; run it with a command like: python convert.py alpaca-native-4bit.pt --vocab-dir VOCAB_DIR where VOCAB_DIR is a directory containing both tokenizer.model and added_tokens.json (the latter of which is specific to Alpaca).

Mar 27 '23 04:03 comex

I just tried it and it works like a charm!! GPTQ quantized models will be the standard and thanks to you the CPU users can enjoy it aswell

Thanks again for your really important work 😄 👍

Mar 27 '23 11:03 BadisG

@BadisG Did you notice an increase in model size after converting to ggml? The 7B one i converted went from 3.77GB to 5.39GB and inference is significantly slower, but it works.

Mar 27 '23 14:03 Belluxx

@Belluxx Yeah, the file went bigger, maybe it could be more optimized Idk, only @comex has the explaination about that 😅

Mar 27 '23 14:03 BadisG

@BadisG Thanks for the info, at least now i know that it's not just me

Mar 27 '23 14:03 Belluxx

Hmm, it's probably because of the addends (aka zeros). The newer GPTQ-for-LLaMa format quantizes the addends, but llama.cpp doesn't support that, so the script dequantizes them. I didn't realize it would make that big of a difference in size; sounds like it would be useful to add native support for quantized addends to llama.cpp.

But I don't know what you mean by "inference is significantly slower". Compared to what? If the comparison is to a GPU implementation then yes, llama.cpp will be slower.

Mar 27 '23 18:03 comex

@comex Thank you for the explanation. About the slower inference, i forgot to mention that it was due to swap because i just have 8GB of ram. However it's a bit weird since i didn't have anything opened or in the background.

Mar 27 '23 18:03 Belluxx

Yeah it's a bit slower when using the GPTQ:

Regular RTN quantization:

main: seed = 1
system_info: n_threads = 14 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.700000, top_k = 40, top_p = 0.100000, repeat_last_n = 2048, repeat_penalty = 1.250000
generate: n_ctx = 2024, n_batch = 8, n_predict = 2024, n_keep = 0


 Here's 5 reasons that proves video-games are good for your brain:
1. Video games can help improve cognitive skills such as memory, problem solving and reaction time. Studies have found that regular gamers show improved performance in these areas compared to non-players.
2. Research has also shown that playing action or adventure games increases the density of neurons in the hippocampus, which is associated with learning and emotional processing. This suggests that gaming could be beneficial for overall mental health.
3. Playing puzzle and strategy games helps sharpen abstract thinking abilities by requiring players to think ahead and plan strategies. These types of games may even increase creativity levels.
4. In addition, research shows that engaging in mentally challenging activities like gaming can reduce inflammation in the brain, protect against age-related declines in cognition, and slow down the progression of neurodegenerative diseases.
5. Finally, studies suggest that virtual reality (VR) technology offers a unique opportunity to explore how different experiences affect people’s brains. VR provides an immersive experience that allows users to interact with digital environments while being monitored through physiological measures. Through this type of experiment, scientists hope to gain insight into how our minds work and what effects certain stimuli might have on us both psychologically and physiologically. [end of text]

llama_print_timings:        load time =  6639.43 ms
llama_print_timings:      sample time =   955.15 ms /   283 runs   (    3.38 ms per run)
llama_print_timings: prompt eval time =  1715.08 ms /    19 tokens (   90.27 ms per token)
llama_print_timings:        eval time = 60649.22 ms /   282 runs   (  215.07 ms per run)
llama_print_timings:       total time = 70376.90 ms

GPTQ quantization:

main: seed = 1
system_info: n_threads = 14 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.700000, top_k = 40, top_p = 0.100000, repeat_last_n = 2048, repeat_penalty = 1.250000
generate: n_ctx = 2024, n_batch = 500, n_predict = 2024, n_keep = 0


 Here's 5 reasons that proves video-games are good for your brain:
1. Improves problem solving skills - Playing puzzle and strategy games can help improve problem solving skills by requiring players to think logically, strategize and make decisions in order to progress through the game. This type of thinking is useful when applied to real life situations where logical thought processes need to be employed.
2. Enhances spatial awareness – Many action adventure or first person shooter (FPS) games require quick reflexes as well as an understanding of how to maneuver around obstacles on a virtual map. These types of games enhance one’s spatial awareness which helps with navigation in everyday life.
3. Boosts memory retention– Memory retention refers to the ability to remember information over time. Video games have been found to increase short term recall and long term storage of information in the brain. Studies show improved cognitive function after playing certain video games.
4. Strengthens hand eye coordination – Playing fast paced action games such as FPS or fighting games requires excellent hand eye coordination. The act of quickly aiming and shooting at targets has been shown to strengthen this skill set in gamers. Increased accuracy leads to better reaction times in other areas of gaming and even sports.
5. Encourages creative thinking – Creative thinking involves using abstract thoughts to solve problems. Games like brainteasers, logic puzzles and riddles encourage out of the box solutions to complex issues. This encourages innovation and lateral thinking which can lead to new ideas and inventions. [end of text]

llama_print_timings:        load time =  2094.55 ms
llama_print_timings:      sample time =  1084.30 ms /   331 runs   (    3.28 ms per run)
llama_print_timings: prompt eval time =  2227.22 ms /    19 tokens (  117.22 ms per token)
llama_print_timings:        eval time = 87885.16 ms /   330 runs   (  266.32 ms per run)
llama_print_timings:       total time = 93656.60 ms

Something like ~20% slower, that's probably expected because the RTN version has a size of 4.1 GB and the GPTQ version has a size of 5.2 GB (27% difference)

Mar 27 '23 19:03 BadisG

Converting alpaca-native-GPTQ models into ggml models

Expected Behavior

Current Behavior