bosh-deployment icon indicating copy to clipboard operation
bosh-deployment copied to clipboard

health monitor keeps triggering scan & fix tasks due to slow nats client

Open fkittelinger opened this issue 5 months ago • 0 comments

Stemcell: bosh-openstack-kvm-ubuntu-jammy-go_agent-raw/1.803 Bosh version: v282.0.0 bosh-openstack-cpi: 55.0.1 Managing 731 deployments, 1273 agents

We ran into the following situation:

The amount of bosh scan and fix tasks keeps being around the count of deployments (500-700 tasks). After the task was done, a new scan and fix has been queued immediately. From metrics perspective the VMs of that director were unresponsive, but when checking with bosh vms or bosh instances, all the VMs were found to be healthy.

In the health_monitor logs the following line appears repetitively: ERROR : NATS client error: nats: slow consumer, messages dropped

A restart of the health_monitor process helps to unstuck the situation, the bosh scan & fix tasks decrease. After the restart we are now seeing 1277 Nats onnection, checked with netstat -anp | grep 4222

Before and after the huge queue of scan and fix tasks, the health_monitor logs also show numerous lines like

I, [2025-08-08T06:47:27.906126 #7] INFO : [ALERT] Alert @ 2025-08-08 06:47:27 UTC, severity 1: process is not running I, [2025-08-08T06:47:27.906275 #7] INFO : (Event logger) notifying director about event: Alert @ 2025-08-08 06:47:27 UTC, severity 1: process is not running

fkittelinger avatar Aug 08 '25 11:08 fkittelinger