auto-round icon indicating copy to clipboard operation
auto-round copied to clipboard

lm_head is not converted to QuantLinear with MXFP4/8

Open xin3he opened this issue 2 months ago • 14 comments

lm_head quantization still have some issues.

  • need deepcopy if tied_word_embedding = True
  • export is not applied for lm_head

Shall we warn user that lm_head is not supported? @WeiweiZhang1 @wenhuach21

xin3he avatar Nov 17 '25 05:11 xin3he

BTW, AFAIK, QuantLinear for MXFP4/8 has no forward function and may confuse user about how to use it. Do we plan to support it?

xin3he avatar Nov 17 '25 05:11 xin3he

if tied_word_embedding = True, lm-head quant is disabled by default. What's the issue? what do you mean "QuantLinear for MXFP4/8 has no forward function "

wenhuach21 avatar Nov 17 '25 05:11 wenhuach21

If user prefer to quantize lm_head, what's the solution?

Image

xin3he avatar Nov 17 '25 05:11 xin3he

how do you run the model, why quantlinear has no forward?

wenhuach21 avatar Nov 17 '25 05:11 wenhuach21

https://github.com/intel/auto-round/blob/8d8a1cd5daaf6e8c71d079eccaec3092fa9af4f1/auto_round/export/export_to_autoround/qlinear_fp.py#L61 It's not implemented in AutoRound

xin3he avatar Nov 17 '25 05:11 xin3he

how do you use the model? please attach the cmd. After packing and saving, the model should be reloaded and MXFP4QuantLinear this layer should be called

wenhuach21 avatar Nov 17 '25 05:11 wenhuach21

I'm using the model for export after quantize_and_save() and just aware that AutoRound requires reloading before inference.

xin3he avatar Nov 17 '25 05:11 xin3he

I tried Qwen3-8b which is not using tied_word_embeding, the lm_head is still not quantized. I notice the quantization bar contains this op while module replacement is not enabled.

Image

xin3he avatar Nov 17 '25 05:11 xin3he

do you enable quant_lm_head or set bits for lm-head

wenhuach21 avatar Nov 17 '25 05:11 wenhuach21

Image

xin3he avatar Nov 17 '25 05:11 xin3he

Do we plan to support lm_head for tied_word_embedding=True?

xin3he avatar Nov 17 '25 06:11 xin3he

Image

After reloading I saw the lm_head is quantized, not sure what is happening. @WeiweiZhang1 Do you have any comments? Do you think it's a bug or it is designed.

xin3he avatar Nov 17 '25 06:11 xin3he

at least need to warn users that xin's way is not supported

wenhuach21 avatar Nov 17 '25 06:11 wenhuach21

Image For NVFP4, lm_head quantization will get assert error during exporting.

xin3he avatar Nov 17 '25 07:11 xin3he