aomp icon indicating copy to clipboard operation
aomp copied to clipboard

The addtional thread for GPU

Open ye-luo opened this issue 4 years ago • 2 comments

It seems that offload plugin or down to hsa creates additional thread to communicate with the GPU. Such additional thread seems floating around and competing with regular OpenMP threads. It is also hard to control its affinity and make performance very complicated. HIP got rid of such an additional thread by "Direct Dispatch" as ROCm 4.5 release note mentions.

ye-luo avatar Nov 04 '21 19:11 ye-luo

The plugin doesn't spawn additional threads on trunk. Aomp will do if the kernel uses host services, e.g. malloc or printf, which if I understand correctly qmcpack doesn't. I think HSA does create additional threads, but not to launch kernels. I'll see if I can find the release notes in question.

quoting https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html

In this release, for Direct Dispatch, the runtime directly queues a packet to the AQL queue (user mode queue to GPU) in Dispatch and some of the synchronization. This new functionality indicates the total latency of the HIP Dispatch API and the latency to launch the first wave on the GPU.

OpenMP has done that for ~ a year or so, since dropping the ATMI dependency. So while there may be (probably are) additional threads, that one ^ was never a feature. I'm going to leave this open as it would be a good idea to work out what extra threads HSA spawns and how they can be controlled.

JonChesterfield avatar Nov 08 '21 11:11 JonChesterfield

FYI: From a miniQMC run

OMP_NUM_THREADS=8 rocprof --hsa-trace ./bin/check_spo -n 1

3916422 is not an OpenMP thread but it only calls many hsa_system_get_info in the initialization. I'm refering to 3916430 which is not an OpenMP thread and calls hsa_signal_wait_scacquire all the time.

Screenshot from 2021-11-14 13-58-18

ye-luo avatar Nov 14 '21 20:11 ye-luo

We need a long discussion on this. For now, the start of the thread can be eliminated by removing printf , malloc, and free from target regions. They need to be removed even if they are not executed.

gregrodgers avatar Oct 18 '22 20:10 gregrodgers

Unclear to if the additional thread is gone or not. So assume fixed for now. I will open a new issue if it reappears.

ye-luo avatar Apr 28 '23 15:04 ye-luo