Node running prometheus becomes unavailable

Open ayellapragada opened this issue 4 years ago • 0 comments

This has been something happening on Albaik recently, the node that prometheus is running on becomes unavailable, which causes issues for monitoring and application level autoscaling. The current fix is to manually intervene and replace the node, as documented [here].(https://github.com/thoughtbot/mission-control-platform/blob/main/aws/src/debug/cluster-errors.md#unreachable-nodes)

This issue is to track debugging the process and figuring out what happens, and resolve it automatically.

Nov 08 '21 14:11 ayellapragada