NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Add option for mutex timeout in distributed optimizer backward hook

Open timmoon10 opened this issue 1 year ago • 3 comments

What does this PR do ?

This is to help debug a hang at https://github.com/NVIDIA/NeMo/blob/f658b6f0445403c338c7371941b1fe644832df48/nemo/core/optim/distributed_adam.py#L131

Add a one line overview of what this PR aims to accomplish.

Collection: NLP

Changelog

  • Add option for mutex timeout in distributed optimizer backward hook

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable the distributed optimizer with model.optim.name=distributed_fused_adam and set the timeout with model.optim.lock_timeout=<seconds>.

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI. The GitHub Actions CI will run automatically when the PR is opened. To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • [x] Make sure you read and followed Contributor guidelines
  • [ ] Did you write any new necessary tests?
  • [x] Did you add or update any necessary documentation?
  • [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [x] New Feature
  • [ ] Bugfix
  • [ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

timmoon10 avatar May 01 '24 23:05 timmoon10

Could we replace self._lock itself with the timeout-enabled one? The parent Apex distributed optimizer class also uses self._lock (e.g., here) and we want to catch those as well if they take too long.

minitu avatar May 01 '24 23:05 minitu

Testing a modified version in this draft PR: https://github.com/NVIDIA/NeMo/pull/9087

minitu avatar May 02 '24 00:05 minitu

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] avatar May 16 '24 01:05 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar May 23 '24 01:05 github-actions[bot]