Too many open files
I hit: OSError: [Errno 24] Too many open files when running lots of, quite short, tests.
Total tests: 1326, each of about 30 seconds. I'm doing pairwise network tests on a whole rack of new nodes, with max_jobs set to 20.
In the log I see:
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: 0.017s run: 18.778s sanity: 0.015s performance: 0.024s total: 1625.969s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAIL^[[0m ] (1015/1326) 2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u09a on bluebear:icelake using none [compile: 0.016s run: 46.224s total: 1623.775s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u09a'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: 0.016s run: 46.224s sanity: n/a performance: n/a total: 1623.775s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAIL^[[0m ] (1016/1326) 2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u10a on bluebear:icelake using none [compile: 0.018s run: 47.207s total: 1623.224s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u10a'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.012s compile: 0.018s run: 47.207s sanity: n/a performance: n/a total: 1623.224s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAIL^[[0m ] (1017/1326) 2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u11b on bluebear:icelake using none [compile: 0.019s run: 48.527s total: 1622.391s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07b-bear-pg0104u11b'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.012s compile: 0.019s run: 48.527s sanity: n/a performance: n/a total: 1622.391s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAIL^[[0m ] (1018/1326) 2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u23a on bluebear:icelake using none [compile: n/a run: n/a total: 0.013s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'compile': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u23a'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: n/a run: n/a sanity: n/a performance: n/a total: 0.013s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAIL^[[0m ] (1019/1326) 2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u20b on bluebear:icelake using none [compile: n/a run: n/a total: 0.013s]
[2021-10-22T13:19:37] info: reframe: ==> test failed during 'compile': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u07a-bear-pg0104u20b'
[2021-10-22T13:19:37] verbose: reframe: ==> timings: setup: 0.013s compile: n/a run: n/a sanity: n/a performance: n/a total: 0.013s
[2021-10-22T13:19:37] info: reframe: [ ^[[31m FAILED ^[[0m ] Ran 1019/1326 test case(s) from 1326 check(s) (5 failure(s), 0 skipped)
[2021-10-22T13:19:37] info: reframe: [==========] Finished on Fri Oct 22 13:19:37 2021
[2021-10-22T13:19:37] info: reframe: ==============================================================================
[2021-10-22T13:19:37] info: reframe: SUMMARY OF FAILURES
[2021-10-22T13:19:38] info: reframe: ------------------------------------------------------------------------------
So we fail at that point and that last 300 tests are not run.
It also fails with max_jobs set to 2000 - more than the number of jobs being run. The first failure was also seen at the same number (1015) of tests:
[2021-10-22T15:25:26] info: reframe: [ ^[[32m OK^[[0m ] (1014/1326) 2020a-gompi-osu-mpi-bear-pg0104u32a-bear-pg0104u34b on bluebear:icelake using none [compile: 0.007s run: 209.231s total: 209.285s]
[2021-10-22T15:25:26] verbose: reframe: ==> timings: setup: 0.032s compile: 0.007s run: 209.231s sanity: 0.020s performance: 0.022s total: 209.285s
[2021-10-22T15:25:26] info: reframe: [ ^[[31m FAIL^[[0m ] (1015/1326) 2020a-gompi-osu-mpi-bear-pg0104u26b-bear-pg0104u29a on bluebear:icelake using none [compile: 0.007s run: 229.943s total: 229.988s]
[2021-10-22T15:25:26] info: reframe: ==> test failed during 'sanity': test staged in '/rds/projects/2017/branfosj-rse/BEAR-git/reframe/stage/bluebear/icelake/none/2020a-gompi-osu-mpi-bear-pg0104u26b-bear-pg0104u29a'
Could you check what are the file limits in your test system?
@branfosj you can get with ulimit -n. I suspect it's 1024
@teojgo Yes, it is:
$ ulimit -n
1024
@branfosj could you try increasing it. Check your hard limit ulimit -H -n and increase the limit to it ulimit -n <new_limit>
@branfosj Do your tests have dependencies?
I've set off a run with an increased ulimit -n.
None of my tests have dependencies.
With ulimit -n 2000 my tests complete successfully.