How can we determine when the replica subset data between StatefulSet (or deployment) is fully synchronized?
For certain reasons, we need to perform a Blue-Green deployment.
So Im reaching out to ask, Is there a way to detect when the replica subsets across nodes are fully synchronized and ready before scaling down either the blue or green stage
One way to confirm is by checking the logs which will look like this:
Term: 3, pending_queue: 0, last_index: 13, committed: 13, known_applied: 13, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 44941
When queued_writes are 0 and last_index matches known_applied the log is fully caught up.
@kishorenc ohh that is really nice. thanks a lot @kishorenc
I have a case where queued_writes are not zero but last_index matches known_applied. What should I assume from this ?
search-stg-sts-2 typesense I20250331 10:51:43.592514 167 raft_server.cpp:706] Term: 22, pending_queue: 0, last_index: 8561, committed: 8561, known_applied: 8561, applying: 0, pending_writes: 0, queued_writes: 472, local_sequence: 548470
It is stuck like that for hours. (No OOM situation for any Pod, Cluster has declared itself healthy, I can get read results from the API). I poked as well with the --healthy-write-lag, but nothing is moving.
That's odd. Something is blocking the index queue from completing. Are you using joins by any chance?
No, I just updated to 28.0.rc37 after i found #2137 and the recommendation from @jasonbosco https://github.com/typesense/typesense/issues/2137#issuecomment-2644332766, but so far same thing as outcome (actually the queue_writes in that node increased and started then falling but stuck then to 158, before was 472; still covered by the --healthy-write-lag so I guess it is not lagging)
One way to figure out whether it's a clustering issue or an indexing/code issue:
- take a backup of the entire data directory of that pod
- restore the data directory on a new machine / pod where it starts as a single node cluster
If there is an issue with the indexing logic it should still get stuck on a single node mode.
One way to figure out whether it's a clustering issue or an indexing/code issue:
- take a backup of the entire data directory of that pod
- restore the data directory on a new machine / pod where it starts as a single node cluster
If there is an issue with the indexing logic it should still get stuck on a single node mode.
just ran typesense on single node cluster, and im getting stuck queued writes:
8065 queued writes > healthy read lag of 1000
E20250331 12:30:05.567620 166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
I20250331 12:30:06.567806 166 raft_server.cpp:692] Term: 44, pending_queue: 0, last_index: 12331946, committed: 12331946, known_applied: 12331946, applying: 0, pending_writes: 0, queued_writes: 8065, local_sequence: 40417490
I20250331 12:30:06.567953 196 raft_server.h:60] Peer refresh succeeded!
E20250331 12:30:14.568861 166 raft_server.cpp:771] 8065 queued writes > healthy read lag of 1000
E20250331 12:30:14.568931 166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
I20250331 12:30:16.569242 166 raft_server.cpp:692] Term: 44, pending_queue: 0, last_index: 12331946, committed: 12331946, known_applied: 12331946, applying: 0, pending_writes: 0, queued_writes: 8065, local_sequence: 40417490
I20250331 12:30:16.569413 200 raft_server.h:60] Peer refresh succeeded!
E20250331 12:30:23.570302 166 raft_server.cpp:771] 8065 queued writes > healthy read lag of 1000
E20250331 12:30:23.570364 166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
vesrion used: 29.0.rc9
Would you be able to share this data directory with me? You can zip it and email to [email protected] or DM me on the community channel on Slack..
Unfortunately the directory is quite large, so we remigrated from ground up on version 28.0 on railway instead of GKE and it now works.
Sorry, I was of not much help.