zig `GeneralPurposeAllocator` reports `error.OutOfMemory` despite having more than enough free memory.

Zig Version

0.12.0-dev.2150+63de8a598 (linux)

Steps to Reproduce and Observed Behavior

I observed that my application would sometimes crash with an OutOfMemory error. This didn't make any sense to me. In total my system was reporting just around 40% memory usage, and additionally the error would happen during a phase where a lot of memory was freed. There were also no large allocations happening and the allocation where it crashed was a mere 24 bytes.

I could trace the error back to the mmap call in the page allocator. From the linux documentation, one of the sources for out of memory is this:

       ENOMEM The process's maximum number of mappings would have been
              exceeded.  This error can also occur for munmap(), when
              unmapping a region in the middle of an existing mapping,
              since this results in two smaller mappings on either side
              of the region being unmapped.

This gave me some clues to make a simple reproducible:

const std = @import("std");

var global_gpa = std.heap.GeneralPurposeAllocator(.{.thread_safe=true}){};
const allocator = global_gpa.allocator();
var allocations: [200000][]u8 = undefined;

pub fn main() !void {
	for(0..allocations.len) |i| {
		allocations[i] = try allocator.alloc(u8, 8192);
	}
	std.log.err("Allocations done", .{});
	for(0..allocations.len) |i| { // Freeing every second allocation, to maximize the number of individual mappings
		if(i % 2 == 0) {
			allocator.free(allocations[i]);
		}
	}
	_ = try allocator.alloc(u8, 1); // Allocating anything causes OutOfMemory
}

Output:

$ zig run test.zig
info: Allocations done
thread 39168 panic: reached unreachable code
Unwind error at address `exe:0x1061799` (error.OutOfMemory), trace may be incomplete

Unable to dump stack trace: OutOfMemory
Aborted (core dumped)
$ zig run test.zig -OReleaseFast
error: Allocations done
error: OutOfMemory

Expected Behavior

From a general purpose allocator I expect it to be able to fully use the memory the system can provide(minus internal fragmentation of course). The c_allocator doesn't have this problem, because I think it doesn't unmap pages as aggressively as the GPA.

Feb 01 '24 16:02 IntegratedQuantum

This replicates the issue directly on std.heap.page_allocator:

const std = @import("std");
const page_allocator = std.heap.page_allocator;

var pages: [128 * 1024]*[4096]u8 = undefined;

pub fn main() !void {
    for (&pages) |*page| {
        page.* = try page_allocator.create([4096]u8);
    }

    for (pages, 0..) |page, i| {
        if (i & 1 == 0) {
                continue;
        }
        page_allocator.destroy(page);
    }
}

(Exact number of pages needed may need to be tweaked based on your system configuration)

The problem is with Zig's posix implementation of munmap():

/// Deletes the mappings for the specified address range, causing
/// further references to addresses within the range to generate invalid memory references.
/// Note that while POSIX allows unmapping a region in the middle of an existing mapping,
/// Zig's munmap function does not, for two reasons:
/// * It violates the Zig principle that resource deallocation must succeed.
/// * The Windows function, VirtualFree, has this restriction.
pub fn munmap(memory: []align(mem.page_size) const u8) void {
    switch (errno(system.munmap(memory.ptr, memory.len))) {
        .SUCCESS => return,
        .INVAL => unreachable, // Invalid parameters.
        .NOMEM => unreachable, // Attempted to unmap a region in the middle of an existing mapping.
        else => unreachable,
    }
}

The documentation clearly suggests a model where each call to mmap() creates a new mapping, and so as long as you call munmap() with the same bounds as each mmap(), it cannot fail. Unfortunately, this is incorrect, at least on Linux: when allocating anonymous memory with mmap(), the kernel tries to allocate a region of address space adjacent to an existing mapping, and will opportunistically merge with that mapping wherever possible. The result is that most calls to mmap() only extend an existing mapping, not create a new one, and thus also most paired calls to munmap() are in fact unmapping part of a mapping,

Unfortunately this means that in general memory deallocation on Linux can fail and this needs to be worked around by userspace. I believe the usual approach is to keep track of regions that you've failed to unmap so you can coalesce them with new unmap requests until you either reach a big enough region that it can be unmapped without splitting a mapping, or else unrelated unmap requests get the system away from the vm.max_map_count limit. The problem, of course, is that coalescing is only robust as a solution if regions are being coalesced in a single place for the whole process, and not separate places for Zig and for libc's allocator and whatever other libraries are in play that might be directly performing munmap() calls (and probably just permanently leaking their regions if an error occurs).

Mar 29 '24 10:03 klkblake

@klkblake thank you for this breakdown and analysis. This is maddening, and I will need to go through the 5 stages of grief before suggesting a course of action.

Aug 16 '24 07:08 andrewrk

It's worth noting that madvise(MADV_DONTNEED_LOCKED) will discard the underlying memory backing the region passed to it, without affecting the VMA, and so will succeed as long as arguments are aligned correctly and the memory is neither VM_PFNMAP (special kernel memory -- e.g. perf event ringbuffer, possibly io_uring stuff?) nor sealed read-only (sealed memory can never be munmap()ed, so it's irrelevant here). (It will also ENOMEM if not all of the range given is mapped, but that's informational and does not prevent it working on the memory that is there).

So, for any ordinary allocation or mmap()ed file, we can reliably discard the memory backing the region even if we fail to munmap() it. I'm not sure how much it helps, to have the problem of being unable to reliably free virtual address space (and the associated memory used for page tables, etc) instead of unable to free memory generally, since you still need to either report errors on deallocation or have complex handling? It helps prevent out of memory conditions caused by coalescing holding onto large amount of memory, at least.

(MADV_DONTNEED_LOCKED is identical to the older MADV_DONTNEED, except the former works on pages locked by mlock() (and possibly other similar things?), which seems like the right behaviour in a context where you were trying to unmap the region anyway. MADV_FREE also exists, but somewhat counterintuitively, does not free memory immediately, instead marking the pages as reclaimable by the kernel whenever it feels like it, with this mark being cleared if the memory is written to before then. (MADV_FREE does not reduce the reported RSS of the process, for some reason, until the reclaim actually happens). MADV_FREE also only works on private anonymous mappings.)

Dec 04 '24 04:12 klkblake

That's good to know we can still release the memory via madvise. So the question then becomes what to do about the unbounded growth of zombie virtual address space?

Since the kernel is unable to guarantee it can track freed virtual address space, it seems like we could implement this ourselves in user space. A relatively simple thread safe "free list" used on corresponding calls to mmap seems like it could work. You can even use some of the "unfreeable pages" to back this list. It wont solve it for code that invokes mmap via libc or the syscall directly but will work in the zig universe.

However, I wonder if even this is overkill...assuming we can still free the memory via madvise, are we ever at risk of exhausting the virtual address space? Surely not on 64 bit systems?

Dec 04 '24 05:12 marler8997

I'd be surprised if you could 64 bit address space exhaustion from this naturally, I'd expect you'd only really see it in programs deliberately trying to trigger it. For 32 bit however, I'd expect it's a much more realistic concern -- 4GB of address space just isn't very much to work with. That said, it's not just address space exhaustion that is a concern -- page tables and the kernel data structures tracking pages take memory themselves. It's not necessarily a huge amount, but if it's building up over time as more memory gets allocated and can't get unmapped, a program that theoretically ought to have relatively low, bounded memory usage may have its actual usage get arbitrarily large.

In practice, for normal programs, if you coalesce regions that have been freed but can't be unmapped, try to unmap them when you can, and use them for new allocations when possible, things will mostly just work, which is presumably why the issue persists. It's just that this isn't reliable; it mitigates the issue, but you can construct examples that will defeat any approach like this.

It may be worth considering the case of a minimal setting of vm.max_map_count, that permits us to have our program, the stack, and exactly one mmap() mapping. In a situation like this, we can only map and unmap memory on the edges of our mapping, never in the middle. Functionally, in this context, mmap()/munmap() is a stack allocator, not a GPA, and the problems we have in this setting are mostly the same issues you have if you try to treat an ordinary stack allocator like a GPA. A program in this situation is fine if it's using a stack allocator for memory management -- the issue only arises when we are trying to provide a general purpose allocator layered over a system that fundamentally isn't one.

(Obviously such a setting for vm.max_map_count is unreasonable and unlikely to occur, but in practice the above scenario is what you're in if e.g. a previous phase of your program has driven you up to the map limit with allocations that won't be freed until a later phase.)

I don't know what the right approach is here, but thinking about this example makes me think that possibly it might look like splitting virtual address space allocation from memory allocation? The current approach treats the allocation and deallocation of virtual address space as an implementation detail handled implicitly by the memory allocator you use, and the problem is arising from that implementation detail leaking out. And the whole approach with coalescing and reusing address space etc is nearly identical to a memory allocator in implementation (and indeed, writing to any of the discarded pages will cause memory to be attached). So maybe you have virtual memory allocator that handles unused regions we can't unmap yet, and memory allocators run on top of those, and deallocation of memory is very clearly not the same thing as deallocation of virtual memory?

This doesn't actually solve the problem of deallocating virtual address space being something that can fail (I don't think this can be properly solved without convincing the Linux kernel devs to change things), but it at least pulls the badness out into its own place and brings it to the programmer's attention as something they have to choose a policy about how to handle, and lets us have memory deallocation always succeed even if the virtual address space that may have been used for that is only opportunistically freed.

Dec 04 '24 08:12 klkblake

I had hoped that it would be fixed after the debug allocator changes (as far as I've seen the size of the memory mappings got increased?) in 0.14.0, but it still breaks in my game, even with the SmpAllocator.

Mar 05 '25 21:03 IntegratedQuantum