uvos

Results 201 comments of uvos

So the proximte cause of the errors below Note 1 and Note 3 is ncclGroupEnd() returning ncclUnhandledCudaError here: https://github.com/ROCm/rccl/blob/6d34fb76321600d5693b24f1edc875605c5cc638/src/init.cc#L2428 however things go off the rails before that already with ncclCalloc...

rccl allso apears to have a double free in ncclCommInitAll_impl as gpuFlags is freed here: https://github.com/ROCm/rccl/blob/6d34fb76321600d5693b24f1edc875605c5cc638/src/init.cc#L2417 and if any of the other calls with NCCLCHECKGOTO( fail it is again freed...

compileing with rccl "-O1 -fno-strict-aliasing" solves ncclCalloc not allocateing any memory, clearly rccl has UB. But this dose not solve ncclGroupEnd failing.

Note the gpus are connected via PCIE only not xgmi/if Futher traceing has the failure occureing at: https://github.com/ROCm/rccl/blob/9aa5b9f02e20cc95e465004af728b907ab178fe9/src/group.cc#L331 due to the other thread failing in ncclCommInitRankFunc I dont think futher...

> > and if any of the other calls with NCCLCHECKGOTO( fail it is again freed here: > > That part should be fine, it sets it to nullptr after...

@LunNova @LunNova Ok so the issue is actually that in rocm 6.2.x HSA_ENABLE_IPC_MODE_LEGACY is ineffective: https://github.com/ROCm/ROCR-Runtime/blob/df7549038b458c9387a2c6ea8d9328e3c9e6620c/src/core/util/flag.h#L233 IMO the main problem here is that ROCR simply chooses kfd_ipc on the mainline...

this shows the issues under note 1 and note 3 are ROCR's fault and not RCCL's however the issue under Note 2 seams wholly unrelated, different problem and the UB...

No, thats a terrible idea. MAP_PRIVATE requires the kernel to reserve space for the file in its overcommit total, as the kernel might need to find the ram/swap to store...

I am having the same problem with current dev and trying to profile the vgprbound kernel in the occupancy.hip example in this repo. I am on rocm 6.3.2 and rocprofiler-compute...

This is in fact a bug with rocprofv1 specifically (which seams to get chosen by default) setting ROCPROF=rocprofv2 makes the supported metrics show up. Understanding what metrics are supported is...