Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x
as $title. Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: "Internal", or link to GitHub issue (if applicable).
What were the changes?
One sentence describing the work done.
Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.
How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.
Additional Documentation:
What else should the reviewer know?
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.
Looks good to me. Thanks for the fix @dmwu!
@dmwu, does Meta have any interest in contributing back the buck2 targets files and build infra to RCCL? I think this may reduce future issues for you, as AMD may be able to catch issues like this sooner.
@alex-breslow-amd This is a good idea. Let me discuss internally on this and followup
@nileshnegi The Rock CI multi-node tests keeps failing. Mind taking a look? thanks!
@nileshnegi The Rock CI multi-node tests keeps failing. Mind taking a look? thanks!
there's a credentials issue with TheRock multi-node CI at the moment. while folks resolve that, we can merge this PR.
@nileshnegi I don't see the merge PR option available on this page. Is it because the failing CI?