flagd icon indicating copy to clipboard operation
flagd copied to clipboard

[BUG] FlagSyncService failure not reflected in Healthchecks

Open rgrizzell opened this issue 10 months ago • 1 comments

Observed behavior

Hello! I'm not sure if this is one or two bugs (or my lack of understanding about health checks), but they appear to be related.

When using Flagd with an HTTP sync target and the connection to the target times out, requests to the In-process Resolver on port 8013 are refused. However, the /healthz endpoint on 8014 still reports healthy.

  • HTTP: 200 OK (No response body)
  • gRPC: status: SERVING

Additionally, when Flagd is in this sync failure state, it does not restart gracefully using systemctl restart flagd. It must terminate the process with pkill -9 flagd, otherwise the Flagd process will continue to "run".

2025-03-28T07:03:47.479-0500        info        cmd/start.go:124        flagd version: v0.12.1 (82dc4e4c6c229e42ecb723f4866ba343be9d2b89), built at: 2025-02-04        {"component": "start"}
2025-03-28T07:03:47.480-0500        info        flag-sync/sync_service.go:87        starting flag sync service on port 8015        {"component": "FlagSyncService"}
2025-03-28T07:03:47.482-0500        info        ofrep/ofrep_service.go:58        ofrep service listening at 8016        {"component": "OFREPService"}
2025-03-28T07:03:47.483-0500        info        flag-evaluation/connect_service.go:249        metrics and probes listening at 8014        {"component": "service"}
2025-03-28T07:03:47.483-0500        info        flag-evaluation/connect_service.go:229        Flag IResolver listening at [::]:8013        {"component": "service"}
2025-03-28T07:03:47.484-0500        info        flag-sync/sync_service.go:155        shutting down gRPC sync service        {"component": "FlagSyncService"}
2025-03-28T07:03:47.484-0500        info        ofrep/ofrep_service.go:69        shutting down ofrep service        {"component": "OFREPService"}
2025-03-28T07:03:52.486-0500        warn        flag-sync/sync_service.go:113        timeout while waiting for all sync sources to complete their initial sync. continuing sync service        {"component": "FlagSyncService"}
2025-03-28T07:03:52.486-0500        warn        flag-sync/sync_service.go:122        error from sync server start: grpc: the server has been stopped        {"component": "FlagSyncService"}

Expected Behavior

When the Sync Service inside Flagd is not running or can not connect to the sync target, I expect the /healthz endpoint to return a non-200 for HTTP requests, and NOT_SERVING for gRPC health queries.

Steps to reproduce

Run Flagd with an HTTP sync target that allows the TCP connection to be opened, but times out waiting for an HTTP response.

flagd --uri=http://localhost:8500/v1/kv/config.json?raw=true

In our case, we pull the config from Consul's KV using a local Consul Agent, but the agent isn't connected to the rest of the cluster.

rgrizzell avatar Mar 28 '25 17:03 rgrizzell

Hey @rgrizzell, I agree that doesn't seem correct. We'll try and take a look as soon as possible. However, most of the team will be traveling next week. In the meantime, if you're able to identify the root cause, feel free to open a PR. Thanks!

beeme1mr avatar Mar 28 '25 19:03 beeme1mr