saood06 comments

Results 25 comments of


                                            saood06

Add try join all chunked

> Thanks for the PR! What are the advantages of this over try_buffer_unordered and try_buffered? Sorry, I did not know they existed, they do seem to handle the use case...

Optimized DeepSeek V2/V3 implementation (MLA)

@fairydreaming Is there any reason this should cause issues with RPC. Encountered: ``` ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU evaluate_and_capture_cuda_graph: op not...

Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications

You should also test performance, here is an example of my performance results from testing some Deepseek V3 mixes that are very close in size. ![performance_comparison_tg](https://github.com/user-attachments/assets/2de07604-c800-4198-9581-4dce6b99ac9b) Showing where I have...

Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications

> Are those both Q4_K_M? I didn't swap out any weights for that one to, say, I-quants, so I wouldn't have expected any major performance differences.. > > Also all...

Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications

> Yeah you're right I should double check some performance metrics, but I think overall it should be similar to before since I'm not introducing any crazy changes like swapping...

NUMA-aware KV cache buffer type (experimental)

>Edit: tested this, found no meaningful difference in performance. Not my experience so far, this seems to help me

NUMA-aware KV cache buffer type (experimental)

> It could be that the performance gains are only visible when using longer context. Based on my hardware, and habit of looking at numastat while running models on my...

NUMA-aware KV cache buffer type (experimental)

> I made a custom `Q5_K_XL` 463GB quant where everything is `Q8_0` apart from the non-shared experts' tensors (`Q5_K` for up/gate projections and `Q6_K` for down projections). >a ~250GB `Q2_K`...

imatrix : use GGUF to store importance matrices

> Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire...

llama : add option to override model tensor buffers

> Quick, non-scientific initial test with Deepseek R1 at q6 on llama-server with -ot exps=CPU: > > -ngl 0 = 4.65t/s -ngl 10 = 5.15t/s -ngl 20 = 5.64t/s -ngl...