StableLM Model files are big?

https://huggingface.co/stabilityai/stablelm-base-alpha-3b/tree/main

Looks like 3B is 14.7GB, and if I understand correctly, it's supposed to be f16. Even with f32, it should be about 11.2G. With f16, 5.6G. Am I missing something?

For reference LLaMA 7B (f16) is 12.6G.

upd: I guess it's actually f32. But still seems a little bigger than should be?

Apr 19 '23 16:04 python273

The actual model sizes are: 3B: 3,638,525,952 7B: 7,869,358,080

The fp32 weights are provided to allow users to reduce precision to their needs. We will consider providing the weights in f16 since this is a common complaint :)

Thank you for pointing it out!

Apr 19 '23 17:04 jon-tow

Ok, the size seems about right then.

# took the size from disk. huggingface shows in / 1000**3
>>> (10_161_140_290+4_656_666_941) / 1024 / 1024 / 1024
13.800158380530775
>>> (3_638_525_952 * 4) / 1024 / 1024 / 1024
13.5545654296875

f16 weights would be nice, to download less stuff

Apr 19 '23 17:04 python273

@jon-tow on this topic, do you expect these models to quantize well down to 4bits (or lower) via GPTQ and/or other quantizing strategies?

I don't see why not, since GPTQ seems to be a general technique that works well for different transformer models. But I'm asking because part of reason behind Stable Diffusion's success is from how well it runs on consumer hardware. So I'm wondering if these models will follow a similar goal, of running very well on consumer hardware, and therefore consider quantization from the very beginning?

Apr 19 '23 23:04 andysalerno

Hi, @andysalerno! I do expect these models to quantize quite well. They're pretty wide, which should help reduce bandwidth boundness compared to models of similar size when quantized.

Apr 20 '23 06:04 jon-tow

There's a 4.9GB ggml 4bit GPTQ quantization for StableLM-7B up on HuggingFace which works in llama.cpp for fast CPU inference.

(For comparison, LLaMA-7B in the same format is 4.1GB. But, StableLM-7B is actually closer to 8B parameters than 7B.)

Apr 20 '23 07:04 MarkSchmidty

For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub: https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-7b-16bit https://huggingface.co/vvsotnikov/stablelm-tuned-alpha-3b-16bit

Apr 20 '23 10:04 vvsotnikov

Yeah we need a Colab for this stuff that doesn't crash from ram out of memory lol

Apr 20 '23 15:04 iboyles

There's a 4.9GB ggml 4bit GPTQ quantization for StableLM-7B up on HuggingFace which works in llama.cpp for fast CPU inference.

(For comparison, LLaMA-7B in the same format is 4.1GB. But, StableLM-7B is actually closer to 8B parameters than 7B.)

Hm, how do you actually run this? I tried https://github.com/ggerganov/llama.cpp ( 4afcc378698e057fcde64e23eb664e5af8dd6956 and also 5addcb120cf2682c7ede0b1c520592700d74c87c )

and got:

./main -m ../ggml-q4_0-stablelm-tuned-alpha-7b/ggml-model-stablelm-tuned-alpha-7b-q4_0.bin -p "this is a test"
main: seed = 1682468827
llama.cpp: loading model from ../ggml-q4_0-stablelm-tuned-alpha-7b/ggml-model-stablelm-tuned-alpha-7b-q4_0.bin
error loading model: missing tok_embeddings.weight
llama_init_from_file: failed to load model
main: error: failed to load model '../ggml-q4_0-stablelm-tuned-alpha-7b/ggml-model-stablelm-tuned-alpha-7b-q4_0.bin'

Apr 26 '23 00:04 jrincayc

Hi @jon-tow @python273 Why we have multiple .bin files inside stabilityai/stablelm-base-alpha-7b? When we load the model which bin file is loaded?

May 04 '23 08:05 pratikchhapolika