typesense icon indicating copy to clipboard operation
typesense copied to clipboard

How can we determine when the replica subset data between StatefulSet (or deployment) is fully synchronized?

Open vctqs1anz opened this issue 10 months ago • 9 comments

For certain reasons, we need to perform a Blue-Green deployment.

So Im reaching out to ask, Is there a way to detect when the replica subsets across nodes are fully synchronized and ready before scaling down either the blue or green stage

vctqs1anz avatar Mar 21 '25 04:03 vctqs1anz

One way to confirm is by checking the logs which will look like this:

 Term: 3, pending_queue: 0, last_index: 13, committed: 13, known_applied: 13, applying: 0, pending_writes: 0, queued_writes: 0, local_sequence: 44941

When queued_writes are 0 and last_index matches known_applied the log is fully caught up.

kishorenc avatar Mar 25 '25 13:03 kishorenc

@kishorenc ohh that is really nice. thanks a lot @kishorenc

vctqs1anz avatar Mar 28 '25 03:03 vctqs1anz

I have a case where queued_writes are not zero but last_index matches known_applied. What should I assume from this ?

search-stg-sts-2 typesense I20250331 10:51:43.592514 167 raft_server.cpp:706] Term: 22, pending_queue: 0, last_index: 8561, committed: 8561, known_applied: 8561, applying: 0, pending_writes: 0, queued_writes: 472, local_sequence: 548470

It is stuck like that for hours. (No OOM situation for any Pod, Cluster has declared itself healthy, I can get read results from the API). I poked as well with the --healthy-write-lag, but nothing is moving.

akyriako avatar Mar 31 '25 10:03 akyriako

That's odd. Something is blocking the index queue from completing. Are you using joins by any chance?

kishorenc avatar Mar 31 '25 11:03 kishorenc

No, I just updated to 28.0.rc37 after i found #2137 and the recommendation from @jasonbosco https://github.com/typesense/typesense/issues/2137#issuecomment-2644332766, but so far same thing as outcome (actually the queue_writes in that node increased and started then falling but stuck then to 158, before was 472; still covered by the --healthy-write-lag so I guess it is not lagging)

akyriako avatar Mar 31 '25 11:03 akyriako

One way to figure out whether it's a clustering issue or an indexing/code issue:

  1. take a backup of the entire data directory of that pod
  2. restore the data directory on a new machine / pod where it starts as a single node cluster

If there is an issue with the indexing logic it should still get stuck on a single node mode.

kishorenc avatar Mar 31 '25 11:03 kishorenc

One way to figure out whether it's a clustering issue or an indexing/code issue:

  1. take a backup of the entire data directory of that pod
  2. restore the data directory on a new machine / pod where it starts as a single node cluster

If there is an issue with the indexing logic it should still get stuck on a single node mode.

just ran typesense on single node cluster, and im getting stuck queued writes:

8065 queued writes > healthy read lag of 1000
E20250331 12:30:05.567620   166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
I20250331 12:30:06.567806   166 raft_server.cpp:692] Term: 44, pending_queue: 0, last_index: 12331946, committed: 12331946, known_applied: 12331946, applying: 0, pending_writes: 0, queued_writes: 8065, local_sequence: 40417490
I20250331 12:30:06.567953   196 raft_server.h:60] Peer refresh succeeded!
E20250331 12:30:14.568861   166 raft_server.cpp:771] 8065 queued writes > healthy read lag of 1000
E20250331 12:30:14.568931   166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500
I20250331 12:30:16.569242   166 raft_server.cpp:692] Term: 44, pending_queue: 0, last_index: 12331946, committed: 12331946, known_applied: 12331946, applying: 0, pending_writes: 0, queued_writes: 8065, local_sequence: 40417490
I20250331 12:30:16.569413   200 raft_server.h:60] Peer refresh succeeded!
E20250331 12:30:23.570302   166 raft_server.cpp:771] 8065 queued writes > healthy read lag of 1000
E20250331 12:30:23.570364   166 raft_server.cpp:783] 8065 queued writes > healthy write lag of 500

vesrion used: 29.0.rc9

Aduomas avatar Mar 31 '25 12:03 Aduomas

Would you be able to share this data directory with me? You can zip it and email to [email protected] or DM me on the community channel on Slack..

kishorenc avatar Mar 31 '25 12:03 kishorenc

Unfortunately the directory is quite large, so we remigrated from ground up on version 28.0 on railway instead of GKE and it now works.

Sorry, I was of not much help.

Aduomas avatar Apr 01 '25 11:04 Aduomas