Manish Gupta issues

Results 13 issues of


                                            Manish Gupta

WIP Breaking math shape 161616 to 2 x mma.16816

[Documentation] Fixes the confusion between concatenated vs. composed layout in CuTe documentation

This doc-only PR fixes the confusion between concatenated vs. composed layout in CuTe documentation in the examples for ComposedLayout in layout_algebra.md. Issue #1497

[DOC] CuTe documentation questions, resolving confusion, and a potential doc-only patch

1. The layout at line 208 [here](https://github.com/NVIDIA/cutlass/compare/main...manishucsd:cutlass:patch-1#diff-eb60548665e48e0dd8d9f65f793a0a7896c84b5b2365a6250e440004cd74b0faL208) does not seem to be concatenated layout. If that is indeed true, please see the [patch](https://github.com/NVIDIA/cutlass/compare/main...manishucsd:cutlass:patch-1) and we can fix it avoid the...

documentation

Make mainloop schedule type available as `GemmKernel::Schedule`

This PR makes mainloop schedule uniformly available for mainloop variants. Our usage of this type involves additionally having it as a part of `GemmDescription` object for our kernels and decide...

inactive-30d

[BUG] ElementC=void kernel reads non-void in `GemmDescription`

I am observing gemm_desc.C.element = `bf16`, when I set it as `void`. Please use the following [debug_branch](https://github.com/NVIDIA/cutlass/commit/08d11c55797bcd2a1982edf42b04143b3ccb5ead) Please check if the below print is expected. **description_.C.element bf16 // Is this...

bug

Rowwise F8F8BF16 GEMMs - Auto-generate kernel library, auto-generated heuristics cache, add to FBGEMM quantize_bench

Summary: # Summary - Auto-generated F8F8BF16 Rowwise Scaled Kernels. - Auto-generation of Heuristic Cache. - Add to quantize_bench # Performance Improvements ## DisaggBench Cultass Prefill B=1 T=2048: Elapsed: 109.13ms FLOPs:...

fb-exported

cla signed

[BUG] Mixed Input H100 Kernel Hangs

# FE4M3 x BF16 Kernel Hangs when run with beta=1 Please compile the kernel `cutlass3x_sm90_tensorop_s64x128x16gemm_e4m3_bf16_f32_bf16_bf16_cvt_64x128x128_8x1x1_0_tnt_align16_warpspecialized_pingpong_epi_tma` in profiler and run it with `beta=0` and `beta=1`. It is not just the profiler,...

bug

? - Needs Triage

[FEA] Complete the cutlass::library::GemmDescription class to cover Hopper GEMM kernels

## Issue CUTLASS GemmDevice `Operator` contains compile-time attributes (functional and performance attribute). The GemmDevice `Operator` is consumed by [GemmOperation[3xBase]](https://github.com/manishucsd/cutlass/blob/3c543146d15d0f58bd8b420da09af7b1c7261963/tools/library/src/gemm_operation_3x.hpp#L74-L128). In the past, I have found some of the values in...

feature request

? - Needs Triage

[BUG] Stream-K kernel breaks for some GEMM Problem-K

## GEMM Problem Shape --m=8 --n=8192 --k=8192 Does NOT Work ``` /tools/profiler/cutlass_profiler --dist=uniform,min:-2.3,max:2.3,scale:-1 --kernels=cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem --m=8 --n=8192 --k=8192 --verification-enabled =false ============================= Problem ID: 1 Provider: CUTLASS OperationKind: gemm Operation: cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_4x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem Status:...

bug

? - Needs Triage

[BUG] Unable to run CUTLASS example 65_distributed_gemm

Using patch https://github.com/NVIDIA/cutlass/pull/2086 to compile with CUDA Toolkit 12.6.3 cmake ``` cmake -B../build -S../cutlass -DCUTLASS_NVCC_ARCHS="90a" -DCUTLASS_ENABLE_GDC_FOR_SM90=1 -- CMake Version: 3.31.4 -- CUTLASS 3.8.0 -- CUDART: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/libcudart.so -- CUDA Driver: /home/manish_magic_dev/sdk/cuda/12.6.3/lib64/stubs/libcuda.so...

bug

? - Needs Triage