[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once
Hi, i'm nil0x42, the creator of duplicut I wrote this patch for duplicut's copy of thpool.c, as i had a strange timeout in some unit tests that arised only once per ~1000 runs.
unblock all workers in thpool_destroy(), fix hang with medium-to-large pools
thpool_destroy() is supposed to wake every worker, wait until the pool
is empty, and return quickly.
In practice it can block for dozens of seconds as
soon as the pool size out-grows its one-second “fast-exit” window.
Root cause
-
bsem_post_all()setsv = 1and broadcasts. - The first worker that wakes executes
bsem_wait() → while (v==0) … ; v = 0; …→ the single ticket is consumed, all other awakened threads fall straight back intopthread_cond_wait(). - After that first second, the destroy code slow-polls:
one
bsem_post_all()+sleep(1)per loop → one ticket per second.
With 4000 threads the destructor needs ~4000 s; even with only 16
threads (real-world duplicut run) it can exceed a watchdog such as
timeout 5, hanging about 1 ‰ of the executions.
Fix (pure ANSI C / POSIX, no API change)
-
bsem_wait()now consumes exactly one ticket:
/* take one ticket */
bsem_p->v--;
-
bsem_post_all()grants “infinite” tickets so every waiter can pass:
bsem_p->v = INT_MAX;
pthread_cond_broadcast(&bsem_p->cond);
The single-post path (bsem_post()) is unchanged: one post still wakes
one thread.
All accesses to v remain protected by the semaphore mutex, so no new
data races or lock-order inversions are introduced.
Reproducer added
tests/thpool_destroy_hang.sh
tests/src/thpool_destroy_hang.c
cd tests/
./destroy_hang_deterministic.sh # hangs with old code, passes (<0.2 s) with patch
- 4000-thread pool hangs 100 % of the time without the patch.
- 16-thread pool matches the sporadic timeout seen in
duplicut.
Impact
| Scenario | before | after |
|---|---|---|
thpool_destroy() – 4000 threads |
60 s+ | 170 ms |
duplicut (16 threads) |
1 ‰ hangs (timeout 5) | 0 |
The change makes pool shutdown reliable regardless of the thread count and keeps behaviour identical in every other respect.
Good, this solves #134
I try to solve this by remove the all bsem_* functions and use the cond.
This solve change the code little.