Raymond

Results 10 issues of Raymond

in inference, does the Power-of-Two Factor need to be calculated dynamically? is it time-consuming?

i am wondering in your paper why use latent full precision weights to calculate information entropy rather than binarized weights? It seems make no sense considering latent weights.

hello, it seems your KD_loss function don't have a temperature hyperparameter, is it because the default temperature=1 works?

is there trick can solve the problem or it is a mistake?

你好,如果我想要全局的注意力图,要怎么操作呢,你的demo是对应某个grid的注意力图

interesting work, Since some salient parameters have not been binarized, I am curious about the practical speedup in comparison to floating-point models. Do you utilize some GPU kernel to accelerate...

非常棒的工作,我比较好奇,纯int量化的优势在于速度,但是好像没有底层kernel的支持,还是以全精度(TVM)的方式去计算的,这样int量化的实际价值没有发挥出来,看论文中的数据实际latency没有较FasterTransformer提升太多。

Hi, I try to reproduce the classification accuracy using this code. They correspond to your paper except for swin-base. I only get 68.50%, and there is a 10% gap with...

Accodring to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization, Can I define my model and calibration process and then simply use modelopt.torch.quantization.quantize() ?

question
stale

Hi, in uniform quantization we can do xq = [x/s] + offset and \hat{xq} = (x - offset) * s. However, in NF4 quantization, we need to find the nearest...