cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

kv/bulk: enable async flushing

Open jeffswenson opened this issue 7 months ago • 3 comments

This change re-enables async flushing in the SST batcher. Async flushing is a throughput win for large (1 Tib+) out of order writers. This occurs during index builds, out of order imports, and PCR.

Previously, async flushing was a special case in the batcher. This change reworks async flushing so that all flushes are async. This ensures that context handling and error handling are consistent. This should reduce the probability of an async only bug in the sst writer that is hard to exercise.

Fixes: https://github.com/cockroachdb/cockroach/issues/146828

jeffswenson avatar Jun 16 '25 17:06 jeffswenson

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Jun 16 '25 17:06 blathers-crl[bot]

This change is Reviewable

cockroach-teamcity avatar Jun 16 '25 17:06 cockroach-teamcity

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Jun 16 '25 19:06 blathers-crl[bot]

Thanks for the review!

bors r+

jeffswenson avatar Jun 20 '25 16:06 jeffswenson

Build failed:

craig[bot] avatar Jun 20 '25 16:06 craig[bot]

There in .Close() that was caught by a unit test. If Close was called while there were in-flight flushes, it would wait for the flushes to complete, but it leaked the flush span. This was fixed by calling syncFlush() in .Close() instead of waiting for the err group.

I also added a regression test (TestSSTBatcherCloseWithoutFlush) that tests what happens when .Close() is called with in flight flushes. When writing the test I discovered a subtle difference between test clusters and test servers. Test clusters include validation to ensure no spans are leaked. Test servers do not include this validation.

jeffswenson avatar Jun 20 '25 18:06 jeffswenson

bors r+

jeffswenson avatar Jun 20 '25 19:06 jeffswenson