Add option for mutex timeout in distributed optimizer backward hook
What does this PR do ?
This is to help debug a hang at https://github.com/NVIDIA/NeMo/blob/f658b6f0445403c338c7371941b1fe644832df48/nemo/core/optim/distributed_adam.py#L131
Add a one line overview of what this PR aims to accomplish.
Collection: NLP
Changelog
- Add option for mutex timeout in distributed optimizer backward hook
Usage
Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.
Enable the distributed optimizer with model.optim.name=distributed_fused_adam and set the timeout with model.optim.lock_timeout=<seconds>.
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [x] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [x] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [x] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
Could we replace self._lock itself with the timeout-enabled one? The parent Apex distributed optimizer class also uses self._lock (e.g., here) and we want to catch those as well if they take too long.
Testing a modified version in this draft PR: https://github.com/NVIDIA/NeMo/pull/9087
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.