Runtime hangs on DG2 (and Gen12 iGPU maybe?)
I'm running into random hangs when my app is running during normal use, that began occurring several months ago, roughly September 2023. A stack trace is attached, see end for it. I was doing some digging and found this related comment with the exact same stack trace, although only on DG2 and running an unsupported kernel, while I was able to occasionally reproduce this on Gen12 iGPUs and on a much more modern kernel version.
I'm using whisper.cpp with its OpenCL backend to run arbitrary speech-to-text. If one thread ends up hanging, all other runtime threads also end up hanging, spinning multiple cores to 100%.
I'm very new to all of this so please let me know if there's any information I can supply :)
Host details: GPU: Arc A770 Arch Linux w/ kernel 6.7.3-arch1-1.1 intel-compute-runtime-23.48.27912.11-1
Looking at the backtrace:
- 8 "tokio-runtime-w" threads have yielded their execution in
NEO::CommandStreamReceiver::baseWaitFunction() - 1 "scripty_stt_ser" thread is futex waiting worker closing in
NEO::DrmGemCloseWorker::worker() - 1 "scripty_stt_ser" thread is Tokyo Rust code directly hanging in
futex_wait()syscall
I have the same issue when running openvino model server
Sorry this took me so long to get back to.
Looking at the backtrace:
- 1 "scripty_stt_ser" thread is Tokyo Rust code directly hanging in
futex_wait()syscall
From what I've looked at the code, it seems that this runtime worker is waiting for compute runtime code to return thus making me think this is the issue. Disabling the OpenCL runtime and falling back to CPU makes this issue completely disappear, even after weeks of runtime, compared to usually at most 1 week before it locks up and starts spinning on CPU with OpenCL integration.