nhc icon indicating copy to clipboard operation
nhc copied to clipboard

LBNL Node Health Check

Results 83 nhc issues
Sort by recently updated
recently updated
newest added

nhc/scripts/lbnl_job.nhc: function nhc_job_find_users() ``` if [[ "${JOBUSERS[*]//$JOBUSER}" = "${JOBUSERS[*]}" ]]; then JOBUSERS[${#JOBUSERS[*]}]="$JOBUSER" fi ``` I can not understand why using variable substitution to check element exist in array. The trick...

bug

Hi, we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines. With some poking...

A simple test case: `nhc.conf`: ``` * || TIMEOUT=120 * || check_cmd_status -t 100 -r 0 /bin/true ``` Now if nhc is run on a node, the main process returns...

If a passwd file contains a line like below with a '*' in the second field you'll get an error like below if the CWD's first listed file has a...

how can i use nhc to check my lustre file system theath when i set it use "* || check_cmd_output -t 5 -m '135T' -e '/usr/bin/lfs df -h|grep filesystem|grep T'"...

question
need info
portability

The release information on the GitHub page for 1.4.3: * https://github.com/mej/nhc/releases/tag/1.4.3 should probably be copied into the RELEASE_NOTES.txt file.

This takes work from #84 and expands on it a bit. Now something like this: ``` scontrol reboot ASAP nextstate=DOWN ``` Will reboot a node and when it comes up...

I have only deployed this onto one system and one where I knew there were GPFS network issues with nodes not using RDMA that was configured: ``` [root@p0001 ~]# nhc...

enhancement

These checks we've used to detect kernel memory leaks that will cause a node to consume all physical memory but not be detected by conventional means like top, ps or...

enhancement