bosh-nats-sync failing as long no uaa is available
Describe the bug With the new nats version in bosh 274 we have an issue to deploy bosh. Sometimes the deploy fails with:
Task 12439 | 13:26:49 | L starting jobs: bosh/ad445f92-5fab-4070-bdd4-1071258ba02d (0) (canary)Updating deployment: Expected task '12439' to succeed but state is 'error' (00:08:12)L Error: 'bosh/ad445f92-5fab-4070-bdd4-1071258ba02d (0)' is not running after update. Review logs for failed jobs: health_monitor Task 12439 | 13:32:51 | Error: 'bosh/ad445f92-5fab-4070-bdd4-1071258ba02d (0)' is not running after update. Review logs for failed jobs: health_monitor
We found that the bosh-nats-sync job can not authenticate as long as the codeployed uaa is not running:
[2022-10-13T14:12:56.206749 #647762] INFO : Nats Sync starting... [2022-10-13T14:13:06.290402 #647762] INFO : Executing NATS Users Synchronization [2022-10-13T14:13:06.522845 #647762] ERROR : Failed to obtain token from UAA: #<CF::UAA::BadTarget: error: Failed to open TCP connection to 192.168.1.11:8443 (Connection refused - connect(2) for 192.168.1.11:8443)> [2022-10-13T14:13:06.602752 #647762] FATAL : 401 Unauthorized
So the health-monitor can not use the nats. After the uaa started 5min later everything works fine.
Expected behavior The bosh-nats-sync jobs wait until uaa is started. All jobs that depends on nats like the health_monitor wait until bosh-nats-sync is started.
Versions:
- Infrastructure: AWS
- BOSH version 274.4
- Stemcell version [e.g. ubuntu-jammy/1.18]