[BUG] FlagSyncService failure not reflected in Healthchecks
Observed behavior
Hello! I'm not sure if this is one or two bugs (or my lack of understanding about health checks), but they appear to be related.
When using Flagd with an HTTP sync target and the connection to the target times out, requests to the In-process Resolver on port 8013 are refused. However, the /healthz endpoint on 8014 still reports healthy.
- HTTP:
200 OK(No response body) - gRPC:
status: SERVING
Additionally, when Flagd is in this sync failure state, it does not restart gracefully using systemctl restart flagd. It must terminate the process with pkill -9 flagd, otherwise the Flagd process will continue to "run".
2025-03-28T07:03:47.479-0500 info cmd/start.go:124 flagd version: v0.12.1 (82dc4e4c6c229e42ecb723f4866ba343be9d2b89), built at: 2025-02-04 {"component": "start"}
2025-03-28T07:03:47.480-0500 info flag-sync/sync_service.go:87 starting flag sync service on port 8015 {"component": "FlagSyncService"}
2025-03-28T07:03:47.482-0500 info ofrep/ofrep_service.go:58 ofrep service listening at 8016 {"component": "OFREPService"}
2025-03-28T07:03:47.483-0500 info flag-evaluation/connect_service.go:249 metrics and probes listening at 8014 {"component": "service"}
2025-03-28T07:03:47.483-0500 info flag-evaluation/connect_service.go:229 Flag IResolver listening at [::]:8013 {"component": "service"}
2025-03-28T07:03:47.484-0500 info flag-sync/sync_service.go:155 shutting down gRPC sync service {"component": "FlagSyncService"}
2025-03-28T07:03:47.484-0500 info ofrep/ofrep_service.go:69 shutting down ofrep service {"component": "OFREPService"}
2025-03-28T07:03:52.486-0500 warn flag-sync/sync_service.go:113 timeout while waiting for all sync sources to complete their initial sync. continuing sync service {"component": "FlagSyncService"}
2025-03-28T07:03:52.486-0500 warn flag-sync/sync_service.go:122 error from sync server start: grpc: the server has been stopped {"component": "FlagSyncService"}
Expected Behavior
When the Sync Service inside Flagd is not running or can not connect to the sync target, I expect the /healthz endpoint to return a non-200 for HTTP requests, and NOT_SERVING for gRPC health queries.
Steps to reproduce
Run Flagd with an HTTP sync target that allows the TCP connection to be opened, but times out waiting for an HTTP response.
flagd --uri=http://localhost:8500/v1/kv/config.json?raw=true
In our case, we pull the config from Consul's KV using a local Consul Agent, but the agent isn't connected to the rest of the cluster.
Hey @rgrizzell, I agree that doesn't seem correct. We'll try and take a look as soon as possible. However, most of the team will be traveling next week. In the meantime, if you're able to identify the root cause, feel free to open a PR. Thanks!