TheTinyTeddy comments

Results 6 comments of


                                            TheTinyTeddy

FP8 for flash attention 3 and possible concerns

> FA3 FP8 code is already public in this repo. Accuracy is a open problem, I don't think the community has a consensus on what's the best way to quantize....

FP8 for flash attention 3 and possible concerns

> You can try out the only scalings you suggest (input fp16 but casted to fp8 for matmul) and measure accuracy. This can be done independent of FA3. I don't...

Question regarding the quantization dimension of the weight such as Q4_K format

> The quantization is done along the `k` dimension. Thank you for the clarification! Since the weight is quantized by `W = Q * block_scale + block_min`, so when dequantized...

Question regarding the quantization dimension of the weight such as Q4_K format

Discussion continue here: https://github.com/ggml-org/llama.cpp/discussions/13507

Why the use of FP16 instead of BF16 precision?

Thank you for the swift reply! I was wondering if there is any ongoing plan for releasing a bf16 version?

Question regarding the 2D weight quantization and quality degradation in 1D case

Many thanks for the reply! The Author has published a new paper TetraJet-v2: https://arxiv.org/abs/2510.27527 that uses NVFP4 for LLM training and has shown that 1D outperforms 2D weight quantization FYI:...