zig icon indicating copy to clipboard operation
zig copied to clipboard

Thread Pool occasionally crash

Open H4kor opened this issue 8 months ago • 2 comments

Zig Version

0.14.0

(Running on Ubuntu 22.04)

Steps to Reproduce and Observed Behavior

I'm trying using a Thread Pool to execute tasks in the background. When spawning many tasks the program will eventually crash with a Stacktrace similar to this. When checking with a debugger one thread will fail in this region: https://github.com/ziglang/zig/blob/5ad91a646a753cc3eecd8751e61cf458dadd9ac4/lib/std/Thread/Pool.zig#L290-L295

Stacktrace:

thread 44840 panic: reached unreachable code
/home/h4kor/zig/lib/std/posix.zig:4813:19: 0x10848c2 in munmap (zig_debug)
        .NOMEM => unreachable, // Attempted to unmap a region in the middle of an existing mapping.
                  ^
/home/h4kor/zig/lib/std/heap/PageAllocator.zig:145:21: 0x10eb4e9 in unmap (zig_debug)
        posix.munmap(memory.ptr[0..page_aligned_len]);
                    ^
/home/h4kor/zig/lib/std/heap/PageAllocator.zig:137:17: 0x10eb435 in free (zig_debug)
    return unmap(@alignCast(memory));
                ^
/home/h4kor/zig/lib/std/mem/Allocator.zig:147:25: 0x10e1747 in free (zig_debug)
    return a.vtable.free(a.ptr, memory, alignment, ret_addr);
                        ^
/home/h4kor/zig/lib/std/mem/Allocator.zig:147:25: 0x10e3122 in destroy__anon_24255 (zig_debug)
    return a.vtable.free(a.ptr, memory, alignment, ret_addr);
                        ^
/home/h4kor/zig/lib/std/Thread/Pool.zig:240:43: 0x10e0250 in runFn (zig_debug)
            closure.pool.allocator.destroy(closure);
                                          ^
/home/h4kor/zig/lib/std/Thread/Pool.zig:295:32: 0x10e60b4 in worker (zig_debug)
            run_node.data.runFn(&run_node.data, id);
                               ^
/home/h4kor/zig/lib/std/Thread.zig:488:13: 0x10e371d in callFn__anon_24422 (zig_debug)
            @call(.auto, f, args);
            ^
/home/h4kor/zig/lib/std/Thread.zig:1378:30: 0x10e2cd1 in entryFn (zig_debug)
                return callFn(f, self.fn_args);
                             ^
/home/h4kor/zig/lib/std/os/linux/x86_64.zig:126:5: 0x10e37a1 in clone (zig_debug)
    asm volatile (
    ^
???:?:?: 0x0 in ??? (???)
run
└─ run zig_debug failure
error: the following command terminated unexpectedly:
/home/h4kor/code/zig_debug/zig-out/bin/zig_debug 
Build Summary: 5/7 steps succeeded; 1 failed
run transitive failure
└─ run zig_debug failure

Example Code: This will only crash occasionally, run multiple times.

const std = @import("std");

const Foo = struct {
    fn worker_fn(self: *Foo, i: usize) void {
        _ = self;
        std.time.sleep(i % 1000);
    }
};

pub fn main() !void {
    var ts_allocator = std.heap.ThreadSafeAllocator{
        .child_allocator = std.heap.page_allocator,
        .mutex = std.Thread.Mutex{},
    };
    const allocator = ts_allocator.allocator();

    var pool: std.Thread.Pool = undefined;
    try pool.init(.{ .allocator = allocator, .n_jobs = 4 });

    const foo = try allocator.create(Foo);
    var i: usize = 0;
    while (i < 1_000_000) : (i += 1) {
        try pool.spawn(Foo.worker_fn, .{ foo, i });
    }
}

Expected Behavior

not crashing

H4kor avatar May 16 '25 08:05 H4kor

I couldn't reproduce the crash you described, but I suspect you encounter this because you use std.heap.page_allocator (will allocate entire page per allocation), munmap returns NOMEM:

  ENOMEM No memory is available, or the process's maximum number of  map-
         pings would have been exceeded.

I guess you're running out of mappings, what's your vm.max_map_count? I don't think that's a bug, you should use different allocator when you don't utilize benefits of entire page allocation.

aikawayataro avatar May 16 '25 11:05 aikawayataro

vm.max_map_count = 65530

I'm not 100% sure if the example reproduces the problem I encounter in my project. There is crashes with:

signal SIGSEGV: invalid address (fault address: 0x0)

The run_queue has an invalid address as first.

Image

Can't provide a condensed example for this yet, trying to create a minimal producing example:

pub const PageDirectory = struct {
    pub fn create(allocator: Allocator, fm: *FileManager) !*PageDirectory {
        ...
        // Warm Up
        try self.warm_up_pool.init(.{ .allocator = allocator, .n_jobs = 4 });
        self.warm_up_latch = std.Thread.Mutex{};
        ...
        return self;
    }
    ...
    fn warm_page(self: *PageDirectory, page_id: PageId) void {
        const slot_idx = self.page_map.get(page_id);
        if (slot_idx == null) {
            if (self.load_page(page_id, AccessMode.Read)) |hdl| {
                hdl.latch.unlockShared();
            } else |e| {
                std.log.err("error loading page for warm up , err={}", .{e});
            }
        }
    }


    pub fn warm_up_page(self: *PageDirectory, page_id: PageId) void {
        self.warm_up_latch.lock();
        defer self.warm_up_latch.unlock();
        self.warm_up_pool.spawn(warm_page, .{ self, page_id }) catch |e| {
            std.log.err("error spawning warm up job, err={}", .{e});
        };
    }
    ...
}

H4kor avatar May 16 '25 13:05 H4kor