nhc
nhc copied to clipboard
LBNL Node Health Check
nhc/scripts/lbnl_job.nhc: function nhc_job_find_users() ``` if [[ "${JOBUSERS[*]//$JOBUSER}" = "${JOBUSERS[*]}" ]]; then JOBUSERS[${#JOBUSERS[*]}]="$JOBUSER" fi ``` I can not understand why using variable substitution to check element exist in array. The trick...
Hi, we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines. With some poking...
A simple test case: `nhc.conf`: ``` * || TIMEOUT=120 * || check_cmd_status -t 100 -r 0 /bin/true ``` Now if nhc is run on a node, the main process returns...
If a passwd file contains a line like below with a '*' in the second field you'll get an error like below if the CWD's first listed file has a...
how can i use nhc to check my lustre file system theath when i set it use "* || check_cmd_output -t 5 -m '135T' -e '/usr/bin/lfs df -h|grep filesystem|grep T'"...
The release information on the GitHub page for 1.4.3: * https://github.com/mej/nhc/releases/tag/1.4.3 should probably be copied into the RELEASE_NOTES.txt file.
This takes work from #84 and expands on it a bit. Now something like this: ``` scontrol reboot ASAP nextstate=DOWN ``` Will reboot a node and when it comes up...
I have only deployed this onto one system and one where I knew there were GPFS network issues with nodes not using RDMA that was configured: ``` [root@p0001 ~]# nhc...
These checks we've used to detect kernel memory leaks that will cause a node to consume all physical memory but not be detected by conventional means like top, ps or...