Results 10 comments of Gerald

Awesome project. What's the challenge in implementing this feature? Since the project support the Turing architecture, Turing and Volta architectures seem to be similar in code implementation. > 我们内部先讨论下。 从我们目前的规划看,4月份暂排不上这个feature的开发。

> The figure is not drawn to scale, it's just an illustration. > > The way we do it, softmax only has 1 MUFU (exponential). There's no floating point division....

@IwakuraRein The core reason is likely the low utilization of the SMs. So is there any expected support for Stream-K or Split-K?

@IwakuraRein Thank you for your prompt response! I was referring to whether stream-k or split-k was suitable for the problem size(M=16, N=2560, K=8192) mentioned above, because of the low SM...

Thank you for your reply. I have fully understood the optimization approach of fp8 scale; "Is this still a block for you? " Yes, I am in great need of...

> Why do you introduce `minimax_cache.py` instead of reusing `mamba_cache.py`? Because the internal data structure `self.mamba_cache` in `mamba_cache.py` is not suitable for the cache of MiniMaxText01 Model Linear Attn, and...

> Could you please support the MiniMax VL model as well? I would greatly appreciate it @zwc163 Thank you for your attention. We do not have such a plan in...

> Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference? @zifengdexiatian Two million tokens is not...

> `cutlass::half_t` is the fp16 data type implementation in cutlass. It is defined in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/half.h#L167 Yes. However, [this](https://github.com/NVIDIA/cutlass/blob/main/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu#L123) is an example for int4xbf16.

@IwakuraRein @azhurkevich Thank you for your prompt response! After adjusting the TileShape to Shape, the performance has significantly improved. The processing time has been reduced by approximately 40%. Here are...