Gerald
Gerald
Awesome project. What's the challenge in implementing this feature? Since the project support the Turing architecture, Turing and Volta architectures seem to be similar in code implementation. > 我们内部先讨论下。 从我们目前的规划看,4月份暂排不上这个feature的开发。
> The figure is not drawn to scale, it's just an illustration. > > The way we do it, softmax only has 1 MUFU (exponential). There's no floating point division....
@IwakuraRein The core reason is likely the low utilization of the SMs. So is there any expected support for Stream-K or Split-K?
@IwakuraRein Thank you for your prompt response! I was referring to whether stream-k or split-k was suitable for the problem size(M=16, N=2560, K=8192) mentioned above, because of the low SM...
Thank you for your reply. I have fully understood the optimization approach of fp8 scale; "Is this still a block for you? " Yes, I am in great need of...
> Why do you introduce `minimax_cache.py` instead of reusing `mamba_cache.py`? Because the internal data structure `self.mamba_cache` in `mamba_cache.py` is not suitable for the cache of MiniMaxText01 Model Linear Attn, and...
> Could you please support the MiniMax VL model as well? I would greatly appreciate it @zwc163 Thank you for your attention. We do not have such a plan in...
> Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference? @zifengdexiatian Two million tokens is not...
> `cutlass::half_t` is the fp16 data type implementation in cutlass. It is defined in https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/half.h#L167 Yes. However, [this](https://github.com/NVIDIA/cutlass/blob/main/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu#L123) is an example for int4xbf16.
@IwakuraRein @azhurkevich Thank you for your prompt response! After adjusting the TileShape to Shape, the performance has significantly improved. The processing time has been reduced by approximately 40%. Here are...