[CF-557] Changelog updates should wait for transaction commit

Open erikdw opened this issue 2 years ago • 0 comments

Description

Ensure that we don't propagate changelog updates until after a transaction is committed.

Prior to this fix, every DML statement processed (even within a transaction) would immediately lead to a changelog event being written out. That's problematic when client applications see the changelog event occur, since the clients are racing the transaction commit when that event prompts them to read the LDB for the associated row.

We avoid this by detecting that a transaction has begun. When in a transaction we buffer the changes and wait for the transaction to end before propagating them to the changelog.

Details

We heavily rely on the synchronous nature of this code for the buffering that we are adding. i.e., we don't need any locking since there is no concurrency.
Add comments about how the callback change propagation works (see ApplyDMLStatement in ldb_callback_writer.go).
- On a related note, add a comment above ApplyDMLStatement in ldb_writer_with_changelog.go about how that code isn't actually used.
Pin the Dockerfile's base image to avoid upstream libmusl issues that broke the build.
Add 2 fields to changelog entries, to help with debugging if we ever see any issues around the changelog:
- ledgerSeq: ledger sequence corresponding to the DML statement that led to this changelog entry
- tx: boolean for whether the changelog entry is from a transaction
Don't propagate changes for internal bookkeeping tables that don't appear in the changelog -- we used to prevent the changes from getting to the changelog by only propagating changes for tables with names like family___table.
Add a new unit test that ensures we are getting the number of callbacks we expect, and the number of updates in each callback we expect.

Minor fixes

Remove unused NewChangelogEntry function
Fix typo in config knob name max-healty-latency -> max-healthy-latency
Fix comment typos

TODO

[ ] Figure out whether the gauge metric makes sense for accumulated changes on each reflector. We might just want an ever-increasing counter?
[ ] Update unit-tests to have tables where REPLACE INTO results in 2 SQLite changes like it does for so many of the real ledger statements we have.
- i.e., create a 2nd column in the table, and have the first one be a key, and "replace into" using the same key with a different value.
[ ] Subsequent PR to propagate the change operation (insert or delete). Code was already written but excised to keep this a bit tighter.

Testing

Testing completed successfully if all of these are checked:

[x] Tested locally by running the reflector connected to stage's Ctldb, and running both a tail -F /var/spool/ctlstore/changelog and a simple program that uses the Event Iterator.
[ ] Test in stage by running the update reflector on the node where debezium-service runs, since that will mean the QARows checker is using an LDB being updated by the new code, and that should surface if there's any fundamental bug with updating the LDB since it compares the local LDB against the SoR.
[ ] Test in stage by running the updated reflector on all nodes of the core pool
[ ] Creating some charts in Datadog and check that the metrics we are issuing are what we want.
[ ] Test in stage in centrifuge-destinations pool, so that ctlgate that uses the changelog will be tested.

Jan 11 '24 09:01 erikdw