ROCm icon indicating copy to clipboard operation
ROCm copied to clipboard

Missed synchronization between kernel completion and subsequent dependent data transfer results in an error

Open Rmalavally opened this issue 2 years ago • 2 comments

Missed synchronization between kernel completion and subsequent dependent data transfer results in an error

ROCm OpenMP 5.7.1 and earlier may result in a randomly appearing defect that is observable as target regions computing wrong answer/results. This is due to a missed synchronization between kernel completion and subsequent dependent data transfer.

If this behavior is observed, run the application with the following environment variable set:

HSA_ENABLE_SDMA=0

Note: Performance impact may be observed when the above environment variable is used.

Operating System

Ubuntu 22.04 with AMDGPU 6.2.4 driver

CPU

AMD EPYC 7A53 64-Core Processor, AMD EPYC 7313 16-Core Processor, and others

GPU

MI200, MI100, Radeon Pro W6800

ROCm Version

ROCm 5.7.0, 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

Output of /opt/rocm/bin/rocminfo --support

NA

Rmalavally avatar Oct 31 '23 15:10 Rmalavally

Can you please confirm if this still exists in 6.0? If it was fixed, where was the fix made?

prckent avatar Dec 20 '23 18:12 prckent

Only from looking at what I believe are symptoms of this, this issue still exists in ROCm 6.0.0 and ROCm 6.0.2 (amdgpu driver version 6.3.6). Without the mentioned mitigation, we see spurious fails on OpenMP tests. These fails go away when we use HSA_ENABLE_SDMA=0.

jplehr avatar Feb 15 '24 10:02 jplehr