artetaout issues

Results 7 issues of


                                            artetaout

how to get throughput of training process ?

training log is ``` ng_rate': 5.263157894736842e-06, 'epoch': 0.02} {'loss': 1.2607, 'learning_rate': 6.31578947368421e-06, 'epoch': 0.03} {'loss': 1.2007, 'learning_rate': 7.368421052631579e-06, 'epoch': 0.03} {'loss': 1.1451, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.04} {'loss': 1.1491, 'learning_rate': 9.473684210526315e-06,...

feat: add DeepGEMM for fp8 dense models

## Accuracy * deepgemm_fp8 ``` lm_eval --model vllm --model_args pretrained=/mnt/Meta-Llama-3-8B/,quantization=deepgemm_fp8 --tasks gsm8k --num_fewshot 5 --batch_size auto ... 2025-03-03:13:12:38,519 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated vllm (pretrained=/mnt/Meta-Llama-3-8B/,quantization=deepgemm_fp8),...

did not get improvement from tp/sp overlap

i run examples of te_comm_gemm_overlap.py, and remove backwards code, only forward code. compared to tp allreduce, the tp/sp allgather + reduce scatter is slower my gpu is 8 h20 is...

Support disaggregated prefill ?

I saw your code referring to PD disaggragate. Please tell me how to use it

does ThunderMLA support FP8 KVCache ?

as title ...

尝试量化了下DeepSeek-R1-Distill-Llama-70B 但是测下来效果很差请问怎么调比较好

用的这个命令 ``` python3 examples/quant_model.py \ --model_path /sda/DeepSeek-R1-Distill-Llama-70B \ --tokenizer_path /sda/DeepSeek-R1-Distill-Llama-70B} \ --dtype float16 \ --smooth false \ --rotation true \ --dataset wikitext2 \ --nsamples 128 \ --w_quantizer FixedQuantize \ --w_group_size...

[Usage]: Does Mooncake support npu->gpu transport currently ?

### Describe your usage question Mooncake v0.3.7.post2 Refer to `https://kvcache-ai.github.io/Mooncake/getting_started/quick-start.html#transfer-engine-quick-start` Changes the `client_buffer` to tensor, and `server_buffer` to gpu-tensor. Finally got coredump... So do we support npu->gpu currently ? ###...

artetaout