Add initial support for rsvd accounting hugetlb cgroup

Open odinuge opened this issue 5 years ago • 1 comments

The previous non-rsvd max/limit_in_bytes does not account for reserved huge page memory, making it possible for a processes to reserve all the huge page memory, without being able to allocate it (due to cgroup restrictions).

In practice this makes it possible to successfully mmap more huge page memory than allowed via the cgroup settings, but when using the memory the process will get a SIGBUS and crash. This is bad for applications trying to mmap at startup (and it succeeds), but the program crashes when starting to use the memory. eg. postgres is doing this by default.

This has lead to strange segfaults like these: https://github.com/zalando/patroni/issues/1393

More info can be found here: https://lkml.org/lkml/2020/2/3/1153

In order to solve this, I think we have to main ways to do it:

Add writes (when supported) to rsvd for the current hugepageLimits found here. Silently ignore when rsvd is not supported.
Add another element called something like hugepageLimitsRsvd to enforce the rsvd. value, silently fail or return error when rsvd is not supported.

I lean toward the first approach, since adding a new item makes it harder to understand, and may lead into "bad" implementations, but am an not sure at all. The pro for the last one, for having a separate entity is that it is then up to the user of the runtime to decide, giving the "user" a full choice, even tho. i see no real reason to enforce the "old" value and not the reserved one. The current behavior makes a cgroup limited process able to reserve all the huge page memory available on a node, making it inaccessible to others.

No matter the decition, we should then update the config-linux.md docs to clarify how it should work.

Any thoughts?

Simple WIP in runc to add support for enforcing it using the hugepageLimits is here: https://github.com/opencontainers/runc/pull/2360/files

Jun 07 '20 13:06 odinuge

@odinuge I think it's reasonable to go w/ way#1 since

in general reservation limits are superior to page fault limits as the limits are enforced at reservation time and never causes the application to get SIGBUS signal if the memory was reserved before hand which is more friendly to applications, so it seems fine to default it to reserved limits and fallback when unsupported;
it reduces implementation efforts and risks by not introducing a new field/struct.

There might be concerns for compatibility. Picking this up by firstly drafting an update on the spec. Let's see if any feedbacks from the community. Any comments are much appreciated, thanks!

Aug 06 '21 09:08 kailun-qin