Missed synchronization between kernel completion and subsequent dependent data transfer results in an error
Missed synchronization between kernel completion and subsequent dependent data transfer results in an error
ROCm OpenMP 5.7.1 and earlier may result in a randomly appearing defect that is observable as target regions computing wrong answer/results. This is due to a missed synchronization between kernel completion and subsequent dependent data transfer.
If this behavior is observed, run the application with the following environment variable set:
HSA_ENABLE_SDMA=0
Note: Performance impact may be observed when the above environment variable is used.
Operating System
Ubuntu 22.04 with AMDGPU 6.2.4 driver
CPU
AMD EPYC 7A53 64-Core Processor, AMD EPYC 7313 16-Core Processor, and others
GPU
MI200, MI100, Radeon Pro W6800
ROCm Version
ROCm 5.7.0, 5.7.1
ROCm Component
No response
Steps to Reproduce
No response
Output of /opt/rocm/bin/rocminfo --support
NA
Can you please confirm if this still exists in 6.0? If it was fixed, where was the fix made?
Only from looking at what I believe are symptoms of this, this issue still exists in ROCm 6.0.0 and ROCm 6.0.2 (amdgpu driver version 6.3.6).
Without the mentioned mitigation, we see spurious fails on OpenMP tests. These fails go away when we use HSA_ENABLE_SDMA=0.