alertmanager Heartbeat (Webhook) stuck after opsgenie connection issue

What did you do? Recently we started to get regular opsgenie heartbeat expired alerts. The logs of alertmanager indicated that there was an interim issue connecting to opsgenie. Maybe opsgenie is less reliable recently which revealed an bug that exists for a long time.

Alertmanager stopped sending heartbeats/alerts via the webhook integration at all after this issue. Restarting Alertmanager solved the issue. We've also observed the same issue using webhook on another setup that is unrealted to opsgenie.

What did you expect to see? Alertmanager webhook integration should recover from connection issues. What did you see instead? Under which circumstances? Alertmanager webhook integration did not recover from the issue itself and required a restart to recover. Environment kube-prometheus-stack, victoria metrics and alertmanager version 0.25.0 and 0.26.0 are affected

System information:

Kubernetes / GKE and Rancher / RKE2

Alertmanager version:

Initially we observed the issue with alertmanager 0.25.0 and upgraded to 0.26.0 hoping to solve the issue.
But 0.26.0 showed the exact same error.
version="(version=0.25.0, branch=HEAD, revision=258fab7cdd551f2cf251ed0348f0ad7289aee789)
version="(version=0.26.0, branch=HEAD, revision=d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)"

Prometheus version:

Affected promtheus and victoria metrics setups.
Logs:

ts=2024-01-05T20:52:00.263Z caller=notify.go:757 level=info component=dispatcher receiver=opsgenie.heartbeat integration=webhook[0] aggrGroup="{}/{alertname=~\"Watchdog|InfoInhibitor\"}:{alertname=\"Watchdog\", cluster=\"redacted by me\"}" msg="Notify success" attempts=2
ts=2024-01-05T20:51:59.463Z caller=notify.go:745 level=warn component=dispatcher receiver=opsgenie.heartbeat integration=webhook[0] aggrGroup="{}/{alertname=~\"Watchdog|InfoInhibitor\"}:{alertname=\"Watchdog\", cluster=\"redacted by me\"}" msg="Notify attempt failed, will retry later" attempts=1 err="Post \"<redacted>\": read tcp 172.16.0.20:44786->52.84.251.74:443: read: connection reset by peer"

Sometimes these logs even contain an http status page returned by opsgenie but that would be too noisy to post here.

Jan 12 '24 12:01 D3luxee

We are facing similar issue with this, we are frequently getting heartbeat expiry and goes of after the alertmanager restart. Identified problem seems to be the TCP connection is not getting closed after checking the heartbeat. So, it is not connecting to the available connections at that moment but checks with the existing connection which is kept alive and that connection might be problematic.

alertmanager version 0.25.0

Mar 25 '24 10:03 subhashgehlot

Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted?

Apr 17 '24 17:04 grobinson-grafana

No, issue persists and it needs alert manager restart to solve the problem.

On Wed, Apr 17, 2024 at 11:02 PM George Robinson @.***> wrote:

Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted?

— Reply to this email directly, view it on GitHub https://github.com/prometheus/alertmanager/issues/3669#issuecomment-2061841847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXASL2BA6FMZ7CPUJWYTHTY52WZ5AVCNFSM6AAAAABBYBPES2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DCOBUG4 . You are receiving this because you commented.Message ID: @.***>

Apr 17 '24 17:04 subhashgehlot

No, issue persists and it needs alert manager restart to solve the problem. … On Wed, Apr 17, 2024 at 11:02 PM George Robinson @.> wrote: Given the error read: connection reset by peer it does sound like it is using a "dead" connection where the one side is considered open but the other side is closed, hence the TCP RST. I just looked at the code and the default idle timeout is 5 minutes. Does the issue resolve after 5 minutes if the Alertmanager is not restarted? — Reply to this email directly, view it on GitHub <#3669 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXASL2BA6FMZ7CPUJWYTHTY52WZ5AVCNFSM6AAAAABBYBPES2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRHA2DCOBUG4 . You are receiving this because you commented.Message ID: @.>

You waited 5 minutes?

Apr 17 '24 18:04 grobinson-grafana

Yes, waited for more than 5 minutes and then restarted.

Apr 18 '24 03:04 subhashgehlot