zig icon indicating copy to clipboard operation
zig copied to clipboard

build_runner: add --skip-oom-steps

Open kcbanner opened this issue 2 years ago • 8 comments

Based on this discussion: https://discord.com/channels/605571803288698900/785499283368706060/1100143149101371502

The idea is to allow the CI to skip jobs that have a max_rss that would exceed the total memory of the machine.

Example output:

image

kcbanner avatar Apr 28 '23 04:04 kcbanner

Using --skip-steps=condition,.. and let the user provide its own skip conditions sounds more future proof to me and would enable users to define their own skip logic in the build system. This would go into the direction of https://github.com/ziglang/zig/pull/11472#pullrequestreview-974637748 and could be tailored via "project- + step-local skip name" and the cli establishes the user mapping in the build script to "project-local".

Example: A user may want to define own skip oom logic, for example to have a reserved memory quota etc.

  1. The amount of skipped tests should be also reported in the summary, if a summary is requested, because it is dynamic and possibly unexpected system behavior.
  2. There should be an easy way to rule out machine readable (ie shell etc). if oom was prevented for diagnosis, because it is a dynamic behavior. Is the -fsummary format usable for this? (This could be defered until use cases come up and its added as user-customizable result output to the server)

matu3ba avatar Apr 28 '23 07:04 matu3ba

Using --skip-steps=condition,.. and let the user provide its own skip conditions sounds more future proof to me and would enable users to define their own skip logic in the build system. This would go into the direction of #11472 (review) and could be tailored via "project- + step-local skip name" and the cli establishes the user mapping in the build script to "project-local".

Do you have an example of another condition that would make sense as a build_runner argument, vs the user implementing this custom logic in their Step itself?

Example: A user may want to define own skip oom logic, for example to have a reserved memory quota etc.

This should be possible with the --maxrss argument to the build_runner already?

1. The amount of skipped tests should be also reported in the summary, if a summary is requested, because it is dynamic and possibly unexpected system behavior.

This is already shown, do you mean a different total for tested skipped due to --skip-step-oom?

2. There should be an easy way to rule out machine readable (ie shell etc). if oom was prevented for diagnosis, because it is a dynamic behavior. Is the -fsummary format usable for this? (This could be defered until use cases come up and its added as user-customizable result output to the server)

A consumer of -fsummary could grep for skipped (not enough memory) to discover this (vs just grepping for skipped).

kcbanner avatar Apr 28 '23 14:04 kcbanner

failure from the CI looks related to these changes:

zig build-exe std-native-Debug-libc-cbe Debug native: error: memory usage peaked at 9985744896 bytes, exceeding the declared upper bound of 9126805504

andrewrk avatar Apr 28 '23 18:04 andrewrk

Do you have an example of another condition that would make sense as a build_runner argument, vs the user implementing this custom logic in their Step itself?

Nothing concrete, so I think YAGNI applies.

A consumer of -fsummary could grep for skipped (not enough memory) to discover this (vs just grepping for skipped).

Consider 1. user has not sufficient memory on system due to memory burst and heavy tasks (compiling glibc for 20+ targets for example), 2. user wants to work on a thing, which looks innocent, but introduced oom, 3. due to false quote it never gets run in CI, 4. broken commit on master (CI for code path also deals with memory burst behavior, because saving money etc), 5. (being overly dramatic here) production failure.

Also "* Reduce the amount one must remember". I dont want to remember looking for stuff with grep.

This should be possible with the --maxrss argument to the build_runner already?

Same YAGNI, my bad.

matu3ba avatar Apr 28 '23 23:04 matu3ba

I increased the specified max_rss for that test to 10% more than the highest amount observed in the CI failures.

kcbanner avatar Apr 29 '23 16:04 kcbanner

 Main test suite...
zig build-exe test Debug native: error: error: unable to open 'zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5': FileNotFound

zig build-exe test Debug native: error: the following command exited with error code 1:
C:\actions-runner1\_work\zig\zig\build-release\stage3-release\bin\zig.exe build-exe C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5\source.zig --cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache --global-cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-global-cache --name test -L C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\lib -I C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\include --zig-lib-dir C:\actions-runner1\_work\zig\zig\lib --listen=- 
zig build-exe test ReleaseFast native: error: error: unable to open 'zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5': FileNotFound

zig build-exe test ReleaseFast native: error: the following command exited with error code 1:
C:\actions-runner1\_work\zig\zig\build-release\stage3-release\bin\zig.exe build-exe C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5\source.zig -OReleaseFast --cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache --global-cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-global-cache --name test -L C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\lib -I C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\include --zig-lib-dir C:\actions-runner1\_work\zig\zig\lib --listen=- 
zig build-exe test ReleaseSmall native: error: error: unable to open 'zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5': FileNotFound

zig build-exe test ReleaseSmall native: error: the following command exited with error code 1:
C:\actions-runner1\_work\zig\zig\build-release\stage3-release\bin\zig.exe build-exe C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache\o\c6355d1e884e1becc8cd3c1b7fcb3bc5\source.zig -OReleaseSmall --cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-local-cache --global-cache-dir C:\actions-runner1\_work\zig\zig\build-release\zig-global-cache --name test -L C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\lib -I C:\actions-runner1\_work\zig\zig\..\zig+llvm+lld+clang-aarch64-windows-gnu-0.11.0-dev.1869+df4cfc2ec\include --zig-lib-dir C:\actions-runner1\_work\zig\zig\lib --listen=- 
Build Summary: 2730/2884 steps succeeded; 143 skipped; 3 failed; 52440/55298 tests passed; 2858 skipped (disable with -fno-summary)
test transitive failure

Interesting CI failure

kcbanner avatar Apr 29 '23 20:04 kcbanner

Not sure if something similar applies on Windows: https://github.com/ziglang/zig/issues/14815#issuecomment-1517062282 and/or FlushFileBuffers is needed. See also https://learn.microsoft.com/en-us/answers/questions/606529/do-windows-flush-commands-flush-the-disk-write-cac

matu3ba avatar Apr 29 '23 20:04 matu3ba

Noticed that one of the jobs had timed out - rebased.

kcbanner avatar May 25 '23 14:05 kcbanner

Rebased, ready for review.

kcbanner avatar Aug 08 '23 14:08 kcbanner