9volt icon indicating copy to clipboard operation
9volt copied to clipboard

Cluster check

Open dselans opened this issue 9 years ago • 2 comments

Create a 'cluster'/rollup check.

This would allow you to group multiple checks together and expose the 'cluster' check as a single entity. Thresholds should be percentage based.

Would be nice:

Cluster checks that support usage of 'tags'. Ie. When creating the cluster check, you do not have to specify specific checks, but instead just specify one or more tags that other checks use.

example:

monitor:
  exec-cluster-check:
    type: cluster-tags
    description: cluster check for important execs
    interval: 10s
    monitor-tags:
      - very-important
    warning-threshold: 20% # 20 percent of the checks are failing
    critical-threshold: 50% # 50 percent of the checks are failings
    warning-alerter:
      - primary-slack
    critical-alerter:
      - primary-email
    tags:
      - our-cluster-checks

  exec-check1:
    type: exec
    description: exec check test
    timeout: 5s
    command: echo
    args:
      - hello
      - world
    interval: 10s
    return-code: 0
    expect: hello
    warning-threshold: 1
    critical-threshold: 3
    tags:
      - super-exec-checks
      - very-important

  exec-check2:
    type: exec
    description: exec check test
    timeout: 5s
    command: echo
    args:
      - hello
      - world
    interval: 10s
    return-code: 0
    expect: world
    warning-threshold: 1
    critical-threshold: 3
    warning-alerter:
      - primary-slack
    critical-alerter:
      - primary-email
    tags:
      - super-exec-checks
      - very-important

In the above example:

We create a 'exec-cluster-check' that will monitor the state of 2 checks that were specified through the usage of the very-important tag. If 20% of the underlying checks fail, it will produce a warning alert, if 50% of the underlying checks fail, it will produce a critical alert.

dselans avatar Feb 05 '17 23:02 dselans

Do you anticipate this check running those checks a second time, or re-using the existing check state from the last run?

relistan avatar Mar 04 '17 18:03 relistan

I think this should reuse check state data, not sure how tricky that could be though (having partial state only etc.).

dselans avatar Apr 30 '17 02:04 dselans