Ahmad Kiswani comments

Results 12 comments of


                                            Ahmad Kiswani

train_ngp_nerf_occ.py: RuntimeError: CUDA error: invalid configuration argument

I've encountered the same problem, and after two days of debugging, I believe I've figured it out. The error is not related to the GPU model nor the CUDA version....

run stable diffusion see no space left on device error

The base docker image `nvcr.io/nvidia/pytorch:22.12-py3` is over 18[GB], you can use `docker info` to check where docker stores theimages (`/var/lib/docker/overlay2` on Debian based systems) but I can see you have...

Add option distributed_size to MegatronDistributedFusedAdam

Not stale. @gautham-kollu what are the next steps to merge the PR as it's already approved ?

[SD] install rclone from upstream (fixes issue #751)

should close https://github.com/mlcommons/training/issues/751

feat: Support qwen3-next, mcore path

With https://github.com/terrykong/Megatron-LM/commit/0d401602bf48046683adfc2542a70613f6e772e6 and after https://github.com/NVIDIA-NeMo/RL/pull/1541 is merged, I'll rebase this PR which should reduce it to just configs and tests.

feat: Support qwen3-next, mcore path

@terrykong , ready for review. The "Submodule Fast-Forward" failure is probably because the currently used automodel commit `a2db048383cd54b3fafc928df4c30bf7bbf7c430` is not part of the `nemo-rl-submodule` branch as specified in `.gitmodules`. We...

feat: Support qwen3-next, mcore path

waiting for https://github.com/NVIDIA-NeMo/RL/pull/1568 before rebasing, this should truly reduce the PR to just configs and test scripts.

[General] Improve Logging

@terrykong Can you assign this to me.

Stable Diffusion Dataset

I genuinely dislike piping scripts from the internet into bash. Not only does it pose a security risk, but we also need to freeze rclone to a specific version. https://github.com/mlcommons/training/pull/757...

CLI Architecture Improvement Proposal

a quick note, `ng` might not be a good entrypoint , it conflicts with [angular](https://angular.dev/cli) CLI