build_runner: add --skip-oom-steps
Based on this discussion: https://discord.com/channels/605571803288698900/785499283368706060/1100143149101371502
The idea is to allow the CI to skip jobs that have a max_rss that would exceed the total memory of the machine.
Example output:

Using --skip-steps=condition,.. and let the user provide its own skip conditions sounds more future proof to me and would enable users to define their own skip logic in the build system.
This would go into the direction of https://github.com/ziglang/zig/pull/11472#pullrequestreview-974637748 and could be tailored via "project- + step-local skip name" and the cli establishes the user mapping in the build script to "project-local".
Example: A user may want to define own skip oom logic, for example to have a reserved memory quota etc.
- The amount of skipped tests should be also reported in the summary, if a summary is requested, because it is dynamic and possibly unexpected system behavior.
- There should be an easy way to rule out machine readable (ie shell etc). if oom was prevented for diagnosis, because it is a dynamic behavior. Is the -fsummary format usable for this? (This could be defered until use cases come up and its added as user-customizable result output to the server)
Using
--skip-steps=condition,..and let the user provide its own skip conditions sounds more future proof to me and would enable users to define their own skip logic in the build system. This would go into the direction of #11472 (review) and could be tailored via "project- + step-local skip name" and the cli establishes the user mapping in the build script to "project-local".
Do you have an example of another condition that would make sense as a build_runner argument, vs the user implementing this custom logic in their Step itself?
Example: A user may want to define own skip oom logic, for example to have a reserved memory quota etc.
This should be possible with the --maxrss argument to the build_runner already?
1. The amount of skipped tests should be also reported in the summary, if a summary is requested, because it is dynamic and possibly unexpected system behavior.
This is already shown, do you mean a different total for tested skipped due to --skip-step-oom?
2. There should be an easy way to rule out machine readable (ie shell etc). if oom was prevented for diagnosis, because it is a dynamic behavior. Is the -fsummary format usable for this? (This could be defered until use cases come up and its added as user-customizable result output to the server)
A consumer of -fsummary could grep for skipped (not enough memory) to discover this (vs just grepping for skipped).
failure from the CI looks related to these changes:
zig build-exe std-native-Debug-libc-cbe Debug native: error: memory usage peaked at 9985744896 bytes, exceeding the declared upper bound of 9126805504
Do you have an example of another condition that would make sense as a build_runner argument, vs the user implementing this custom logic in their Step itself?
Nothing concrete, so I think YAGNI applies.
A consumer of -fsummary could grep for skipped (not enough memory) to discover this (vs just grepping for skipped).
Consider 1. user has not sufficient memory on system due to memory burst and heavy tasks (compiling glibc for 20+ targets for example), 2. user wants to work on a thing, which looks innocent, but introduced oom, 3. due to false quote it never gets run in CI, 4. broken commit on master (CI for code path also deals with memory burst behavior, because saving money etc), 5. (being overly dramatic here) production failure.
Also "* Reduce the amount one must remember". I dont want to remember looking for stuff with grep.
This should be possible with the --maxrss argument to the build_runner already?
Same YAGNI, my bad.
I increased the specified max_rss for that test to 10% more than the highest amount observed in the CI failures.
Main test suite...
zig build-exe test Debug native: error: error: unable to open 'zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5': FileNotFound
zig build-exe test Debug native: error: the following command exited with error code 1:
C:\actions-runner1\_work\zig\zig\build-release\stage3-release\bin\zig.exe build-exe C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5\source.zig --cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache --global-cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-global-cache --name test -L C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\lib -I C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\include --zig-lib-dir C:\actions-runner1\_work\zig\zig\lib --listen=-
zig build-exe test ReleaseFast native: error: error: unable to open 'zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5': FileNotFound
zig build-exe test ReleaseFast native: error: the following command exited with error code 1:
C:\actions-runner1\_work\zig\zig\build-release\stage3-release\bin\zig.exe build-exe C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5\source.zig -OReleaseFast --cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache --global-cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-global-cache --name test -L C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\lib -I C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\include --zig-lib-dir C:\actions-runner1\_work\zig\zig\lib --listen=-
zig build-exe test ReleaseSmall native: error: error: unable to open 'zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5': FileNotFound
zig build-exe test ReleaseSmall native: error: the following command exited with error code 1:
C:\actions-runner1\_work\zig\zig\build-release\stage3-release\bin\zig.exe build-exe C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5\source.zig -OReleaseSmall --cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache --global-cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-global-cache --name test -L C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\lib -I C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\include --zig-lib-dir C:\actions-runner1\_work\zig\zig\lib --listen=-
Build Summary: 2730/2884 steps succeeded; 143 skipped; 3 failed; 52440/55298 tests passed; 2858 skipped (disable with -fno-summary)
test transitive failure
Interesting CI failure
Not sure if something similar applies on Windows: https://github.com/ziglang/zig/issues/14815#issuecomment-1517062282 and/or FlushFileBuffers is needed. See also https://learn.microsoft.com/en-us/answers/questions/606529/do-windows-flush-commands-flush-the-disk-write-cac
Noticed that one of the jobs had timed out - rebased.
Rebased, ready for review.