Thread Pool occasionally crash
Zig Version
0.14.0
(Running on Ubuntu 22.04)
Steps to Reproduce and Observed Behavior
I'm trying using a Thread Pool to execute tasks in the background. When spawning many tasks the program will eventually crash with a Stacktrace similar to this. When checking with a debugger one thread will fail in this region: https://github.com/ziglang/zig/blob/5ad91a646a753cc3eecd8751e61cf458dadd9ac4/lib/std/Thread/Pool.zig#L290-L295
Stacktrace:
thread 44840 panic: reached unreachable code
/home/h4kor/zig/lib/std/posix.zig:4813:19: 0x10848c2 in munmap (zig_debug)
.NOMEM => unreachable, // Attempted to unmap a region in the middle of an existing mapping.
^
/home/h4kor/zig/lib/std/heap/PageAllocator.zig:145:21: 0x10eb4e9 in unmap (zig_debug)
posix.munmap(memory.ptr[0..page_aligned_len]);
^
/home/h4kor/zig/lib/std/heap/PageAllocator.zig:137:17: 0x10eb435 in free (zig_debug)
return unmap(@alignCast(memory));
^
/home/h4kor/zig/lib/std/mem/Allocator.zig:147:25: 0x10e1747 in free (zig_debug)
return a.vtable.free(a.ptr, memory, alignment, ret_addr);
^
/home/h4kor/zig/lib/std/mem/Allocator.zig:147:25: 0x10e3122 in destroy__anon_24255 (zig_debug)
return a.vtable.free(a.ptr, memory, alignment, ret_addr);
^
/home/h4kor/zig/lib/std/Thread/Pool.zig:240:43: 0x10e0250 in runFn (zig_debug)
closure.pool.allocator.destroy(closure);
^
/home/h4kor/zig/lib/std/Thread/Pool.zig:295:32: 0x10e60b4 in worker (zig_debug)
run_node.data.runFn(&run_node.data, id);
^
/home/h4kor/zig/lib/std/Thread.zig:488:13: 0x10e371d in callFn__anon_24422 (zig_debug)
@call(.auto, f, args);
^
/home/h4kor/zig/lib/std/Thread.zig:1378:30: 0x10e2cd1 in entryFn (zig_debug)
return callFn(f, self.fn_args);
^
/home/h4kor/zig/lib/std/os/linux/x86_64.zig:126:5: 0x10e37a1 in clone (zig_debug)
asm volatile (
^
???:?:?: 0x0 in ??? (???)
run
└─ run zig_debug failure
error: the following command terminated unexpectedly:
/home/h4kor/code/zig_debug/zig-out/bin/zig_debug
Build Summary: 5/7 steps succeeded; 1 failed
run transitive failure
└─ run zig_debug failure
Example Code: This will only crash occasionally, run multiple times.
const std = @import("std");
const Foo = struct {
fn worker_fn(self: *Foo, i: usize) void {
_ = self;
std.time.sleep(i % 1000);
}
};
pub fn main() !void {
var ts_allocator = std.heap.ThreadSafeAllocator{
.child_allocator = std.heap.page_allocator,
.mutex = std.Thread.Mutex{},
};
const allocator = ts_allocator.allocator();
var pool: std.Thread.Pool = undefined;
try pool.init(.{ .allocator = allocator, .n_jobs = 4 });
const foo = try allocator.create(Foo);
var i: usize = 0;
while (i < 1_000_000) : (i += 1) {
try pool.spawn(Foo.worker_fn, .{ foo, i });
}
}
Expected Behavior
not crashing
I couldn't reproduce the crash you described, but I suspect you encounter this because you use std.heap.page_allocator (will allocate entire page per allocation), munmap returns NOMEM:
ENOMEM No memory is available, or the process's maximum number of map- pings would have been exceeded.
I guess you're running out of mappings, what's your vm.max_map_count?
I don't think that's a bug, you should use different allocator when you don't utilize benefits of entire page allocation.
vm.max_map_count = 65530
I'm not 100% sure if the example reproduces the problem I encounter in my project. There is crashes with:
signal SIGSEGV: invalid address (fault address: 0x0)
The run_queue has an invalid address as first.
Can't provide a condensed example for this yet, trying to create a minimal producing example:
pub const PageDirectory = struct {
pub fn create(allocator: Allocator, fm: *FileManager) !*PageDirectory {
...
// Warm Up
try self.warm_up_pool.init(.{ .allocator = allocator, .n_jobs = 4 });
self.warm_up_latch = std.Thread.Mutex{};
...
return self;
}
...
fn warm_page(self: *PageDirectory, page_id: PageId) void {
const slot_idx = self.page_map.get(page_id);
if (slot_idx == null) {
if (self.load_page(page_id, AccessMode.Read)) |hdl| {
hdl.latch.unlockShared();
} else |e| {
std.log.err("error loading page for warm up , err={}", .{e});
}
}
}
pub fn warm_up_page(self: *PageDirectory, page_id: PageId) void {
self.warm_up_latch.lock();
defer self.warm_up_latch.unlock();
self.warm_up_pool.spawn(warm_page, .{ self, page_id }) catch |e| {
std.log.err("error spawning warm up job, err={}", .{e});
};
}
...
}