TheTinyTeddy
TheTinyTeddy
> FA3 FP8 code is already public in this repo. Accuracy is a open problem, I don't think the community has a consensus on what's the best way to quantize....
> You can try out the only scalings you suggest (input fp16 but casted to fp8 for matmul) and measure accuracy. This can be done independent of FA3. I don't...
> The quantization is done along the `k` dimension. Thank you for the clarification! Since the weight is quantized by `W = Q * block_scale + block_min`, so when dequantized...
Discussion continue here: https://github.com/ggml-org/llama.cpp/discussions/13507
Thank you for the swift reply! I was wondering if there is any ongoing plan for releasing a bf16 version?
Many thanks for the reply! The Author has published a new paper TetraJet-v2: https://arxiv.org/abs/2510.27527 that uses NVFP4 for LLM training and has shown that 1D outperforms 2D weight quantization FYI:...