Xin He comments

Results 50 comments of


                                            Xin He

can not conduct per-tensor quantisation?

Our design disables per-tensor weight selection to avoid exponential growth of the search space. You can check the file `neural_compressor/adaptor/pytorch_cpu.yaml` to add it back if needed. The best recommended quantization...

can not conduct per-tensor quantisation?

Yes, it should work.

Enable Llama MoE models' GPTQ quantization

Hi @YIYANGCAI, I saw nn.Conv2d, nn.Conv1d are supported in GPT. Does that mean MOE have these two op types? I previously thought that only transformer.conv1d is required.

[RFC] Porting INC SmoothQuant recipes to IPEX autotune API

# Motivation SmoothQuant is a popular method to improve the accuracy of int8 quantization. Intel-extension-for-pytorch (IPEX) already supports SmoothQuant and provides good optimizations in performance. Intel Neural Compressor (INC) provides...

[RFC] Porting INC SmoothQuant recipes to IPEX autotune API

After meeting synchronization, we decided to take option3 to ensure the flexibility of post-processing after automatic tuning.

support habana FP8 per channel quantization

Will add UTs later

support habana FP8 per channel quantization

local test result: 12 passed

How to parallelize a model with fake quantization nodes?

Hi @sheegao , you can directly deploy a fake quantized torch model with torch [DistributedDataParallel API](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html). I don't think this will be a problem.

How to parallelize a model with fake quantization nodes?

I think this question should be raised to packages who provide pipeline parallelism or tensor parallelism.

How to evaluate quantised model in Pytorch

For INT8 model inference, `q_model(inputs) == q_model.model(inputs)` I think. The int8 model is `q_model.model`. You can also use our save&load function to get the pure int8 model. ``` q_model.save('saved_results') fp32_model...