Gerald comments

Results 10 comments of


                                            Gerald

[Feature] 建议训练internlm2-chat-7b 的 GPTQ-4bit 量化模型并支持llmdeploy部署

Awesome project. What's the challenge in implementing this feature? Since the project support the Turing architecture, Turing and Volta architectures seem to be similar in code implementation. > 我们内部先讨论下。从我们目前的规划看，4月份暂排不上这个feature的开发。

Question regarding overlapping

> The figure is not drawn to scale, it's just an illustration. > > The way we do it, softmax only has 1 MUFU (exponential). There's no floating point division....

[BUG] Trying to optimize mixed input for kernels

@IwakuraRein The core reason is likely the low utilization of the SMs. So is there any expected support for Stream-K or Split-K?

[BUG] Trying to optimize mixed input for kernels

@IwakuraRein Thank you for your prompt response! I was referring to whether stream-k or split-k was suitable for the problem size(M=16, N=2560, K=8192) mentioned above, because of the low SM...

[QST] Can hopper_int4_fp8_gemm support Scale with zero-point mode?

Thank you for your reply. I have fully understood the optimization approach of fp8 scale; "Is this still a block for you? " Yes, I am in great need of...

[Model][MiniMaxText01] Support MiniMaxText01 model inference

> Why do you introduce `minimax_cache.py` instead of reusing `mamba_cache.py`? Because the internal data structure `self.mamba_cache` in `mamba_cache.py` is not suitable for the cache of MiniMaxText01 Model Linear Attn, and...

[Model][MiniMaxText01] Support MiniMaxText01 model inference

> Could you please support the MiniMax VL model as well? I would greatly appreciate it @zwc163 Thank you for your attention. We do not have such a plan in...

[Model][MiniMaxText01] Support MiniMaxText01 model inference

> Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？ @zifengdexiatian Two million tokens is not...

[QST] Question about example 69

> `cutlass::half_t` is the fp16 data type implementation in cutlass. It is defined in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/half.h#L167 Yes. However, [this](https://github.com/NVIDIA/cutlass/blob/main/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu#L123) is an example for int4xbf16.

[QST] Why hopper-mixed-gemm's Bandwidth Utilization only have ~9% MBU in H100 SXM5?

@IwakuraRein @azhurkevich Thank you for your prompt response! After adjusting the TileShape to Shape, the performance has significantly improved. The processing time has been reduced by approximately 40%. Here are...