typesense How can we determine when the replica subset data between StatefulSet (or deployment) is fully synchronized?

For certain reasons, we need to perform a Blue-Green deployment.

So Im reaching out to ask, Is there a way to detect when the replica subsets across nodes are fully synchronized and ready before scaling down either the blue or green stage

Mar 21 '25 04:03 vctqs1anz

One way to confirm is by checking the logs which will look like this:

 Term: 3, pending_queue: 0, last_index: 13, committed: 13, known_applied: 13, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 44941

When queued_writes are 0 and last_index matches known_applied the log is fully caught up.

Mar 25 '25 13:03 kishorenc

@kishorenc ohh that is really nice. thanks a lot @kishorenc

Mar 28 '25 03:03 vctqs1anz

I have a case where queued_writes are not zero but last_index matches known_applied. What should I assume from this ?

search-stg-sts-2 typesense I20250331 10:51:43.592514 167 raft_server.cpp:706] Term: 22, pending_queue: 0, last_index: 8561, committed: 8561, known_applied: 8561, applying: 0, pending_writes: 0, queued_writes: 472, local_sequence: 548470

It is stuck like that for hours. (No OOM situation for any Pod, Cluster has declared itself healthy, I can get read results from the API). I poked as well with the --healthy-write-lag, but nothing is moving.

Mar 31 '25 10:03 akyriako

That's odd. Something is blocking the index queue from completing. Are you using joins by any chance?

Mar 31 '25 11:03 kishorenc

No, I just updated to 28.0.rc37 after i found #2137 and the recommendation from @jasonbosco https://github.com/typesense/typesense/issues/2137#issuecomment-2644332766, but so far same thing as outcome (actually the queue_writes in that node increased and started then falling but stuck then to 158, before was 472; still covered by the --healthy-write-lag so I guess it is not lagging)

Mar 31 '25 11:03 akyriako

One way to figure out whether it's a clustering issue or an indexing/code issue:

take a backup of the entire data directory of that pod
restore the data directory on a new machine / pod where it starts as a single node cluster

If there is an issue with the indexing logic it should still get stuck on a single node mode.

Mar 31 '25 11:03 kishorenc

One way to figure out whether it's a clustering issue or an indexing/code issue:

take a backup of the entire data directory of that pod

restore the data directory on a new machine / pod where it starts as a single node cluster

If there is an issue with the indexing logic it should still get stuck on a single node mode.

just ran typesense on single node cluster, and im getting stuck queued writes:

8065 queued writes > healthy read lag of 1000
E20250331 12:30:05.567620   166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
I20250331 12:30:06.567806   166 raft_server.cpp:692] Term: 44, pending_queue: 0, last_index: 12331946, committed: 12331946, known_applied: 12331946, applying: 0, pending_writes: 0, queued_writes: 8065, local_sequence: 40417490
I20250331 12:30:06.567953   196 raft_server.h:60] Peer refresh succeeded!
E20250331 12:30:14.568861   166 raft_server.cpp:771] 8065 queued writes > healthy read lag of 1000
E20250331 12:30:14.568931   166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
I20250331 12:30:16.569242   166 raft_server.cpp:692] Term: 44, pending_queue: 0, last_index: 12331946, committed: 12331946, known_applied: 12331946, applying: 0, pending_writes: 0, queued_writes: 8065, local_sequence: 40417490
I20250331 12:30:16.569413   200 raft_server.h:60] Peer refresh succeeded!
E20250331 12:30:23.570302   166 raft_server.cpp:771] 8065 queued writes > healthy read lag of 1000
E20250331 12:30:23.570364   166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500

vesrion used: 29.0.rc9

Mar 31 '25 12:03 Aduomas

Would you be able to share this data directory with me? You can zip it and email to [email protected] or DM me on the community channel on Slack..

Mar 31 '25 12:03 kishorenc

Unfortunately the directory is quite large, so we remigrated from ground up on version 28.0 on railway instead of GKE and it now works.

Sorry, I was of not much help.

Apr 01 '25 11:04 Aduomas