TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Question regarding the 2D weight quantization and quality degradation in 1D case

Open TheTinyTeddy opened this issue 3 months ago • 2 comments

Many thanks for the great work!

In the paper https://arxiv.org/pdf/2502.20853 they use 1D weight quantization with requantization with success, and also from their repo https://github.com/thu-ml/TetraJet-MXFP4Training/issues/2#issuecomment-3454394125 the author mentioned from their experience 1D weight quantization produce better result than 2D weight quantization.

So I was wondering how did you implement the 1D weight quantization (in the transpose dimension in bwd) in your paper that produces worse result compared to 2D case? Did you also use requantization (i.e. quantize then dequantize then transpose then quantize) in 1D weight quantization?

TheTinyTeddy avatar Oct 29 '25 02:10 TheTinyTeddy

Hi @TheTinyTeddy, thank for your question.

No, we do not use requantization when using 1D blocks. Both the regular and transposed data are produced from the same BF16 version of the weights. Please keep in mind that the recipe proposed in this paper is quite different to what we are doing. First of all they are using MXFP4 datatype as opposed to us using nvFP4 datatype. Those 2 types are very different. MXFP4 uses blocks of size 32 and (what is probably the most important here) the E8M0 scaling factor (a power of 2). nvFP4 uses blocks of size 16 (and so the 2D scheme uses 16x16=256 elements per scaling factor rather than 1024 which would be the case for MXFP4) and the scaling factor is of the type E4M3 (which enables more precise quantization, but also introduces larger difference between tensors when requantizing to get the transpose). The experiment shown in the "Pretraining Large Language Models with NVFP4" is also looking at the larger model (12B) and token horizon (10T tokens), which exposes more differences between recipes - a setting that may not matter for a small model may impact a larger model final accuracy.

ptrendx avatar Nov 04 '25 21:11 ptrendx

Many thanks for the reply!

The Author has published a new paper TetraJet-v2: https://arxiv.org/abs/2510.27527 that uses NVFP4 for LLM training and has shown that 1D outperforms 2D weight quantization

FYI: "Thanks for your response! We would like to share with you our new paper TetraJet-v2: https://arxiv.org/abs/2510.27527

In this paper we adopt fully NVFP4 setting for LLMs, and our ablation study in this paper validates the 1D vs 2D problem and the requantization. We wish it would address your concerns."

TheTinyTeddy avatar Nov 06 '25 02:11 TheTinyTeddy