rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x

Open dmwu opened this issue 4 months ago • 1 comments

as $title. Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: "Internal", or link to GitHub issue (if applicable).

What were the changes?
One sentence describing the work done.

Why were the changes made?
Explain the motivation behind the work. Provide any publicly-available historical context.

How was the outcome achieved?
Technical details behind the work. Explain any publicly-available hardware peculiarities.

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • [ ] Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

dmwu avatar Nov 01 '25 19:11 dmwu

Looks good to me. Thanks for the fix @dmwu!

@dmwu, does Meta have any interest in contributing back the buck2 targets files and build infra to RCCL? I think this may reduce future issues for you, as AMD may be able to catch issues like this sooner.

@alex-breslow-amd This is a good idea. Let me discuss internally on this and followup

dmwu avatar Nov 03 '25 18:11 dmwu

@nileshnegi The Rock CI multi-node tests keeps failing. Mind taking a look? thanks!

dmwu avatar Nov 10 '25 17:11 dmwu

@nileshnegi The Rock CI multi-node tests keeps failing. Mind taking a look? thanks!

there's a credentials issue with TheRock multi-node CI at the moment. while folks resolve that, we can merge this PR.

nileshnegi avatar Nov 10 '25 17:11 nileshnegi

@nileshnegi I don't see the merge PR option available on this page. Is it because the failing CI?

dmwu avatar Nov 10 '25 17:11 dmwu