Results 5 issues of Gerald

Hi I see you publish the 2-GPU report in Readme. Can you Please give a script which uses 2 GPU? thanks!

Hello, here are my test logs ``` # command line: CUDA_VISIBLE_DEVICES=7 ./examples/55_hopper_mixeed_dtype_gemm/55_hopper_mixed_dtype_gemm --m=16 --n=6144 --k=2048 --g=128 --mode=1 # Running results: Running in group scale mode. Disposition: Passed Problem Size: 16x6144x2048x1...

inactive-30d

# Purpose This PR is intended to support the [MiniMaxText01](https://huggingface.co/MiniMaxAI/MiniMax-Text-01) model inference. It can run on a single machine with 8xH800 and 8xH20, where a single H800 machine can handle...

needs-rebase

**What is your question?** Great work! Could you please let me know if the word "half" [here](https://github.com/NVIDIA/cutlass/blob/main/examples/69_hopper_mixed_dtype_grouped_gemm/69_hopper_int4_bf16_grouped_gemm.cu#L123) is a typo or intentionally used with special consideration?

question
? - Needs Triage
inactive-30d
inactive-90d

Hi, I tested this file `examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu`, and it shows about a 10%-20% performance improvement under some input sizes. Is there a plan for supporting Scale with zero-point mode? Thank you.

question
? - Needs Triage
inactive-30d