functionstackx

Results 57 issues of functionstackx

### NVIDIA Open GPU Kernel Modules Version 550.90.07 ### Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for...

bug

Since H100s have a power throttling depending on the kernel, it is important to see how the TFLOPs change over time. I have this patch in my internal codebase and...

As discussed on slack, since we are trying to find what the max FLOPs is for each accelerator. I changed warmup to `0`. Without any magic flags on nvidia drivers...

I am attempting to emit pytorch code but unfortunately it does not work for fp8, bf16, and int8. I have tried to patch the converter type dict https://github.com/OrenLeung/cutlass/commit/6d619c964eb8b9c150a5f97891849d33f6ee8b64 This patch...

bug
? - Needs Triage

### 🚀 The feature, motivation and pitch - multimodal feature to benchmark offline latency, throughput and online serving for multimodal for pixtral ### Alternatives - everyone writes their own script...

feature request

on https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul the example runs fine on the existing small m,n,k, but unfortunately when i change my m,n,k to be 8192, i get a runtime error. any pointers or patches...

cuSPARSELt

Hi @hongxiayang @hliuca , It seems like float8 training using `torchao.float8` is not support at the moment. Is there a different library or code path I should be using for...

module: rocm
float8

- [x] change to `NCCL_CROSS_NIC=2` - [x] update from very old `nccl==2.19.4` in ngc 24.01 to `nccl==2.23.4` in ngc 24.12 - [x] change to `QPS_PER_CONNECTION=1` when within the same rail...

OCA Required

Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though...

oncall: distributed
module: rocm
open source
topic: not user facing