Results 50 comments of Xin He

Our design disables per-tensor weight selection to avoid exponential growth of the search space. You can check the file `neural_compressor/adaptor/pytorch_cpu.yaml` to add it back if needed. The best recommended quantization...

Hi @YIYANGCAI, I saw nn.Conv2d, nn.Conv1d are supported in GPT. Does that mean MOE have these two op types? I previously thought that only transformer.conv1d is required.

# Motivation SmoothQuant is a popular method to improve the accuracy of int8 quantization. Intel-extension-for-pytorch (IPEX) already supports SmoothQuant and provides good optimizations in performance. Intel Neural Compressor (INC) provides...

After meeting synchronization, we decided to take option3 to ensure the flexibility of post-processing after automatic tuning.

Hi @sheegao , you can directly deploy a fake quantized torch model with torch [DistributedDataParallel API](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html). I don't think this will be a problem.

I think this question should be raised to packages who provide pipeline parallelism or tensor parallelism.

For INT8 model inference, `q_model(inputs) == q_model.model(inputs)` I think. The int8 model is `q_model.model`. You can also use our save&load function to get the pure int8 model. ``` q_model.save('saved_results') fp32_model...