The quantized model inference speed is slower than the non-quantized model.
I trained a quantized transformer model in CPU environment and inference in CPU environment. The training process I add --quantization-config-path in fairseq-train. But the inference speed in CPU in 3 TIMES slower than the non-quantized model which is trained in GPU environment and inference in CPU environment. My task is translation. Quantization config is: n_centroids: Linear: key: in_features value: {"": 256} Embedding: key: embedding_dim value: {"": 256} block_sizes: Linear: key: fuzzy_name value: {fc: 8, attn: 4, emb: 4} Embedding: key: fuzzy_name value: {emb: 8} layers_to_quantize: - decoder\.layers\.\d+\.fc[12] - decoder\.embed_tokens\.embeddings\.[012]\.[01] - decoder\.layers\.\d+\.self_attn\.(k_proj|v_proj|q_proj|out_proj)
- fairseq Version (0.12.2):
- PyTorch Version (1.12.0)