Ahmad Kiswani
Ahmad Kiswani
I've encountered the same problem, and after two days of debugging, I believe I've figured it out. The error is not related to the GPU model nor the CUDA version....
The base docker image `nvcr.io/nvidia/pytorch:22.12-py3` is over 18[GB], you can use `docker info` to check where docker stores theimages (`/var/lib/docker/overlay2` on Debian based systems) but I can see you have...
Not stale. @gautham-kollu what are the next steps to merge the PR as it's already approved ?
should close https://github.com/mlcommons/training/issues/751
With https://github.com/terrykong/Megatron-LM/commit/0d401602bf48046683adfc2542a70613f6e772e6 and after https://github.com/NVIDIA-NeMo/RL/pull/1541 is merged, I'll rebase this PR which should reduce it to just configs and tests.
@terrykong , ready for review. The "Submodule Fast-Forward" failure is probably because the currently used automodel commit `a2db048383cd54b3fafc928df4c30bf7bbf7c430` is not part of the `nemo-rl-submodule` branch as specified in `.gitmodules`. We...
waiting for https://github.com/NVIDIA-NeMo/RL/pull/1568 before rebasing, this should truly reduce the PR to just configs and test scripts.
@terrykong Can you assign this to me.
I genuinely dislike piping scripts from the internet into bash. Not only does it pose a security risk, but we also need to freeze rclone to a specific version. https://github.com/mlcommons/training/pull/757...
a quick note, `ng` might not be a good entrypoint , it conflicts with [angular](https://angular.dev/cli) CLI