Slow performance

Open alfredcs opened this issue 2 years ago • 1 comments

It took > 2 hr to generate 6 words on the llama 65B with f16 or q8_0. The server has 1x A10g (24G vram) and 8 vCPU/350G ram. Wondering how to speed up the inference.

%./main -m /model/hf/llama-65b-hf/ggml-model-q8_0.bin --temp 0.1 --top-p 0.90 --top-k 3 -n 128 -p 'who is elon musk?' main: build = 681 (a09f919) main: seed = 1686941892 ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA A10G llama.cpp: loading model from /model/hf/llama-65b-hf/ggml-model-q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 8192 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 64 llama_model_load_internal: n_layer = 80 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 22016 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 65B llama_model_load_internal: ggml ctx size = 0.18 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 69679.91 MB (+ 5120.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/83 layers to GPU llama_model_load_internal: total VRAM used: 512 MB .................................................................................................... llama_init_from_file: kv self size = 1280.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 3, tfs_z = 1.000000, top_p = 0.900000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0

who is elon musk? Elon Musk is a South African-

Jun 16 '23 20:06 alfredcs

I see is not offload on GPU because you not defined it and probably works also on 1 thread because you did not defined it as well . I suggest to use 8 threads , 46 layers on gpu and model 65B but q4K_m version instead og q_8 ( q_8 is slightly better , almost not noticeable ) for this machine .... then you should get almost 2 tokens/s

Jun 16 '23 22:06 mirek190