whatdhack comments

Results 16 comments of


                                            whatdhack

[Examples] Adding a tensordict and TorchRL version of the PyTorch example

A simple DQN example would be beneficial to someone getting started, given that DQN's are probably the very first DL RL one gets introduced to.

Meta-Llama-3-70B-Instruct running out of memory on 8 A100-40GB

What is the best way to adapt the 8 checkpoints for A100-80GB/H100 for the 70B model to say 16 A100-40GB ?

Meta-Llama-3-70B-Instruct running out of memory on 8 A100-40GB

@subramen , looks like there are more fundamental issues in adapting the 8 GPU checkpoint to any number higher than 8 . See the following. ` self.n_kv_heads = args.n_heads if...

Is there any sgemm example ( e.g. fp32) ?

Looks like it needs to be modified to get some metrics like the bf16TensorCoreGemm example.

[BUG] cutlass.cute.nvgpu.common.OpError: OpError: expects arch to be one of ['sm_100a', 'sm_100f'], but got sm_121a

After forcing cute to ignore the architecture checks [1](https://github.com/NVIDIA/cutlass/blob/e67e63c331d6e4b729047c95cf6b92c8454cba89/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py#L166) and [2](https://github.com/NVIDIA/cutlass/blob/e67e63c331d6e4b729047c95cf6b92c8454cba89/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/copy.py#L117) , hitting the following mlir issue . So looks like tcgen05 is not supported in DGX Spark. . Is...

[BUG] cutlass.cute.nvgpu.common.OpError: OpError: expects arch to be one of ['sm_100a', 'sm_100f'], but got sm_121a

Any comment or update on this ?