metrics: zos metrics for system and user containers
metrics will be available for prometheus, the node will still do a push (no polling) of the metrics to a configured prometheus endpoint
https://app.mindmup.com/map/_v2/90ceb3f0346c11ebb8fbcddb6bd8c75c
pdf download of the mindmup: zosmonitoring.pdf
the requires metrics are:
- metrics
- cpu
- memory
- disks
- io
- sizes
- subvolumes ?
- actual disk usage
- error rates ?
- number of reservations
- ... ?
Process:
- [x] build collectors for all basic metrics
- [ ] Question: percentage of some values like context switches
- [ ] Question: disk health status
- [x] use aggregation with redis lua script.
Proposal:
Cook up a new deamon that opens a connection on zbus to other deamons. Periodically fetch all metrics from the different deamons: provisiond for reservations, storaged for disks / usage, etc .. When these metrics are computed, push them to prometheus. I think sending data every 10-15 minutes is sufficient.
This way we can create dashboard for farmers on grafana, they can for example aggregate data from all there nodes, set alerts when there are disk failures, check how many stuff is running, ...
I am sure we already can leverage on this https://github.com/prometheus/node_exporter or something similar this already does all the monitoring for you and is prometheus compatible.