C-Thread-Pool icon indicating copy to clipboard operation
C-Thread-Pool copied to clipboard

[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once

Open nil0x42 opened this issue 8 months ago • 1 comments

Hi, i'm nil0x42, the creator of duplicut I wrote this patch for duplicut's copy of thpool.c, as i had a strange timeout in some unit tests that arised only once per ~1000 runs.

unblock all workers in thpool_destroy(), fix hang with medium-to-large pools

thpool_destroy() is supposed to wake every worker, wait until the pool is empty, and return quickly. In practice it can block for dozens of seconds as soon as the pool size out-grows its one-second “fast-exit” window.

Root cause

  • bsem_post_all() sets v = 1 and broadcasts.
  • The first worker that wakes executes bsem_wait() → while (v==0) … ; v = 0; … → the single ticket is consumed, all other awakened threads fall straight back into pthread_cond_wait().
  • After that first second, the destroy code slow-polls: one bsem_post_all() + sleep(1) per loop → one ticket per second.

With 4000 threads the destructor needs ~4000 s; even with only 16 threads (real-world duplicut run) it can exceed a watchdog such as timeout 5, hanging about 1 ‰ of the executions.

Fix (pure ANSI C / POSIX, no API change)

  1. bsem_wait() now consumes exactly one ticket:
/* take one ticket */
bsem_p->v--;
  1. bsem_post_all() grants “infinite” tickets so every waiter can pass:
bsem_p->v = INT_MAX;
pthread_cond_broadcast(&bsem_p->cond);

The single-post path (bsem_post()) is unchanged: one post still wakes one thread. All accesses to v remain protected by the semaphore mutex, so no new data races or lock-order inversions are introduced.

Reproducer added

tests/thpool_destroy_hang.sh tests/src/thpool_destroy_hang.c

cd tests/
./destroy_hang_deterministic.sh # hangs with old code, passes (<0.2 s) with patch
  • 4000-thread pool hangs 100 % of the time without the patch.
  • 16-thread pool matches the sporadic timeout seen in duplicut.

Impact

Scenario before after
thpool_destroy() – 4000 threads 60 s+ 170 ms
duplicut (16 threads) 1 ‰ hangs (timeout 5) 0

The change makes pool shutdown reliable regardless of the thread count and keeps behaviour identical in every other respect.

nil0x42 avatar May 16 '25 16:05 nil0x42

Good, this solves #134 I try to solve this by remove the all bsem_* functions and use the cond. This solve change the code little.

npc1054657282 avatar May 18 '25 11:05 npc1054657282