Per-token dynamic quantization
This PR adds support for per-token dynamic quantization. Quantization scales and zero points are computed "on-the-fly" for each new tensor. Each token has its own quantization scale and zero-point (one value per token). The PR is motivated by the difficulties in quantizing activations for some LLMs, such as Llama2.
I attempted to keep in the same line as other quantization schemes supported by PyTorch. Hence I created a new observer (PerTokenDynamicObserver) and a new fake-quantization module (DynamicFakeQuantize) which inherit from ObserverBase and FakeQuantizeBase, respectively.
A new observer is needed because no existing observer supports a per-token computation of quantization scales (only per-tensor or per-channel). The new fake-quantization is also needed in order to use to execute per-token fake-quantization.