Raymond issues

Results 10 issues of


                                            Raymond

How to calculate Power-of-Two Factor in eq. 11 ?

in inference, does the Power-of-Two Factor need to be calculated dynamically? is it time-consuming?

a little question about information entropy

i am wondering in your paper why use latent full precision weights to calculate information entropy rather than binarized weights? It seems make no sense considering latent weights.

Does KD_Loss need temperature?

hello, it seems your KD_loss function don't have a temperature hyperparameter, is it because the default temperature=1 works?

binaryQuantize() torch.sign will get -1,0,1 three values

is there trick can solve the problem or it is a mistake?

有关grid注意力的疑惑

你好，如果我想要全局的注意力图，要怎么操作呢，你的demo是对应某个grid的注意力图

what is the practical speedup ?

interesting work, Since some salient parameters have not been binarized, I am curious about the practical speedup in comparison to floating-point models. Do you utilize some GPU kernel to accelerate...

关于int量化底层支持

非常棒的工作，我比较好奇，纯int量化的优势在于速度，但是好像没有底层kernel的支持，还是以全精度（TVM）的方式去计算的，这样int量化的实际价值没有发挥出来，看论文中的数据实际latency没有较FasterTransformer提升太多。

reproduced performance is poor in swin-base

Hi, I try to reproduce the classification accuracy using this code. They correspond to your paper except for swin-base. I only get 68.50%, and there is a 10% gap with...

How to quantize customed models, such as LVM?

Accodring to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization, Can I define my model and calibration process and then simply use modelopt.torch.quantization.quantize() ?

question

stale

How to implement normal float NF4?

Hi, in uniform quantization we can do xq = [x/s] + offset and \hat{xq} = (x - offset) * s. However, in NF4 quantization, we need to find the nearest...