flightdeck
flightdeck copied to clipboard
Node running prometheus becomes unavailable
This has been something happening on Albaik recently, the node that prometheus is running on becomes unavailable, which causes issues for monitoring and application level autoscaling. The current fix is to manually intervene and replace the node, as documented [here].(https://github.com/thoughtbot/mission-control-platform/blob/main/aws/src/debug/cluster-errors.md#unreachable-nodes)
This issue is to track debugging the process and figuring out what happens, and resolve it automatically.