sunlight icon indicating copy to clipboard operation
sunlight copied to clipboard

ctlog: test recover behavior with an asynchronous slow backend

Open jellevandenhooff opened this issue 1 year ago • 0 comments

Test recovery behavior when a backend sometimes freezes for a while before writing a file.

A sequencing failure with a delayed earlier write can cause log recovery to get stuck:

  • Sequencing starts.
  • The sequencer starts writing tile file "A" with contents "A1" (asynchronously).
  • The sequencer times out.
  • Sequencing restarts.
  • The sequencer tries to write tile file "A" with contents "A2" and succeeds.
  • The earlier write to "A" succeeds with contents "A1" replacing "A2".
  • The sequencer updates the lock to a tree with hash expecting "A" to contain "A2".
  • The log is in a bad state because "A" contains "A1" but should contain "A2".

In the log this looks as follows:

testlog_test.go:79: time=2024-03-16T18:20:29.746-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=43 entries=1 err="couldn't upload checkpoint to object storage: context deadline exceeded"
testlog_test.go:79: time=2024-03-16T18:20:29.746-07:00 level=WARN source=ctlog.go:213 msg="checkpoint in object storage is older than lock checkpoint" old_size=43 size=44
testlog_test.go:79: time=2024-03-16T18:20:29.892-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=102 entries=1 err="couldn't upload checkpoint to object storage: context deadline exceeded"
testlog_test.go:79: time=2024-03-16T18:20:29.892-07:00 level=WARN source=ctlog.go:213 msg="checkpoint in object storage is older than lock checkpoint" old_size=44 size=103
testlog_test.go:79: time=2024-03-16T18:20:30.031-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=142 entries=1 err="couldn't upload a tile: context deadline exceeded"
testlog_test.go:79: time=2024-03-16T18:20:30.031-07:00 level=WARN source=ctlog.go:213 msg="checkpoint in object storage is older than lock checkpoint" old_size=103 size=142
testlog_test.go:79: time=2024-03-16T18:20:30.134-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=143 entries=1 err="couldn't upload a tile: context deadline exceeded"
testlog_test.go:107: couldn't fetch right edge tiles: downloaded inconsistent tile

jellevandenhooff avatar Mar 17 '24 01:03 jellevandenhooff