sbd cgroup-v2 considerations

prevent possible lockup when format in proc changes
properly get and handle scheduler policy & prio
recognize and try to handle cgroup-v2 similarly
on SCHED_RR failing push to the max with SCHED_OTHER

Just as a preview ... Needs splitting probably. And the cgroup-v2 stuff is ugly:

scanning /proc/sched_debug seems to be the only easy way to find out about CONFIG_RT_GROUP_SCHED being enabled with cgroup-v2
currently (as of 5.4.20) there is no hierarchical rt-budget and so moving to the root-slice in all cases with all consequences
when moving to the root-slice journal stops working
auto and yes for SBD_MOVE_TO_ROOT_CGROUP are behaving the same

Feb 27 '20 18:02 wenningerk

Code-wise it looks reasonable, though I'm not familiar with either cgroup implementation and didn't do any testing. Spelling: "budged" in a couple of places.

It's probably worthwhile to comment, either in the sysconfig file or the code, the conditions under which cgroup v2 will be effective. I.e. what kernel version made it available and what has to be done to switch to it, and how a user could tell what an existing system uses.

Feb 27 '20 23:02 kgaillot

It's probably worthwhile to comment, either in the sysconfig file or the code, the conditions under which cgroup v2 will be effective. I.e. what kernel version made it available and what has to be done to switch to it, and how a user could tell what an existing system uses.

Tried to be a bit more descriptive in the comment before the code that is actually doing the check. As it is there for a while in the kernel and both can be configured I guess going into kernel-versions that would provide some version of cgroup-v2 doesn't make much sense. Fedora 31 seems to be the first distribution using cgroup-v2 by default and although it should be possible I didn't play with switching back and forth. Asking for trouble probably. Effort here is more to live with it if it is there. Even with cgroup-v2 enabled in as in Fedora 31 up to now approaches shouldn't run into issues as long as CONFIG_RT_GROUP_SCHED isn't enabled as moving to root-slice is not needed. Both sbd and corosync will first check for non existent /sys/fs/cgroup/cpu/cpu.rt_runtime_us and be happy. To play with, an otherwise Fedora 31 kernel with CONFIG_RT_GROUP_SCHED enabled can be found under https://koji.fedoraproject.org/koji/taskinfo?taskID=41654832 (don't know when it would be cleaned up).

Feb 28 '20 07:02 wenningerk

Looks reasonable (a bit scary tho) but I have a question. What you mean by "when moving to the root-slice journal stops working"? It's logging to journald or some other journal (sbd, fs, ...)?

Feb 28 '20 07:02 jfriesse

Looks reasonable (a bit scary tho) but I have a question. What you mean by "when moving to the root-slice journal stops working"? It's logging to journald or some sbd journal?

logging stops to work unfortunately. If it was something sbd internal I would have tried to make it work ;-) no idea if it is just that (bad enough but we would have logging in a file as well) or if there are other issues. Anyway stopping via the cgroup is probably not working with all that root-slice switching - which is why I try to prevent it whenever possible.

Feb 28 '20 07:02 wenningerk

Looks reasonable (a bit scary tho) but I have a question. What you mean by "when moving to the root-slice journal stops working"? It's logging to journald or some sbd journal?

logging stops to work unfortunately. If it was something sbd internal I would have tried to make it work ;-) no idea if it is just that (bad enough but we would have logging in a file as well) or if there are other issues. Anyway stopping via the cgroup is probably not working with all that root-slice switching - which is why I try to prevent it whenever possible.

Ok, thanks for the info.

Feb 28 '20 07:02 jfriesse

cherry-picked the travis-config changes needed for mock 2.0 (update in fedora-31) as they are not really related to the topic of this PR. Split off the scheduler-config stuff that isn't actually cgroup-v2 related. Guess it should be OK to cherry-pick that into master as well as it should fix a possible hang-situation when /proc-content changes with some kernel-version & it makes behavior more similar with what corosync is doing (fall back to raising prio to the max within SCHED_OTHER if switch to SCHED_RR is failing).

Mar 02 '20 13:03 wenningerk