redpanda raft: Lazy flush mode for slow drives

Redpanda is more serious about data loss, so it flushes data to disk per message which is good for lots of use cases. The challenge is if the message size is not big, and customers are using HDD, flushing every message can hit the HDD's IOPS very quickly, resulting very low throughput (MB/s).

In clustering mode, if Redpanda can introduce a mode in which when a write lands in memory (or OS page cache) in a quorum, Redpanda then acks the clients. This will alleviate the flush bottleneck and providing even lower latency and higher throughput. As long as Redpanda can reconcile the data and the quorum is not crashed at the same time, customers shall still have their data intact.

Jul 16 '21 00:07 chenziliang

@rystsov + @mmaslankaprv this should helpful. raft quorum write ack in memory. this probably has a bunch of downstream effects like snapshotting and probably keeping track of yet one more offset. haha.

Jul 19 '21 03:07 emaxerrno

Different but related: #272 may help folks affected by this issue by enabling larger write sizes than the default 16kb. I haven't tested performance on spinning disk, but one would expect that a larger IO size (perhaps as large as 1MB?) would be helpful.

Sep 15 '21 09:09 jcsp

Implementing a lazy-flush persistenct mode is probably worthwhile both for slow HDDs, and for people running on cheap virtualized block storage (e.g. cheaper EBS tiers) for people processing ephemeral data where occasional loss is acceptable.

Documentation will be really important to make sure that people do not enable it without understanding the data safety issues. We should make sure that the setting is called something that makes it clear it's a lower grade of persistence like "lazy_writes", and avoid using a seemingly-innocuous name like log.flush.ms.

The lazy flush mode should only apply to end user messages -- we should always use strict data safety for all configuration log entries and the controller log. When users ask for lazy flushes, they're expecting their own topics to occasionally time-travel backwards on crash, they're not expecting Redpanda's own configuration to potentially go backwards.

Sep 15 '21 09:09 jcsp

Implementing a lazy-flush persistenct mode is probably worthwhile both for slow HDDs, and for people running on cheap virtualized block storage (e.g. cheaper EBS tiers) for people processing ephemeral data where occasional loss is acceptable.

are you think time-based flushing or some sort of best effort / low-priority flushing? currently we don't track any dirty state in our batch cache, and the equivalent of a block cache are background futures in the segment appender. working on something like this might be a convenient time to move to a batch cache v2 that unifies these caches and also provides coherent dirty reads.

Sep 20 '21 23:09 dotnwat

@jcsp how does this differ from in-memory acks

Sep 20 '21 23:09 emaxerrno

@jcsp how does this differ from in-memory acks

Different names for the same logical result of acking before persistent: we can implement it at different levels -- the "in memory" might be userspace memory, might be page cache memory, might be drive buffer memory.

I'm thinking there may be some flow control benefit to still having requests wait for their disk write (but not fsync), to avoid accepting writes faster than we can send them to disk, and then the user unexpectedly hitting a latency "wall" when they fill up our in memory buffer.

Sep 21 '21 08:09 jcsp

What about flushing batches (with size, time, or against IOPs limit), and then acking the requests when the batch is done. It may not increase latency (especially as IOPs limit is approached), as the IOPs won't get queued up on the kernel.

Sep 21 '21 15:09 BenPope

@jcsp how does this differ from in-memory acks

Different names for the same logical result of acking before persistent: we can implement it at different levels -- the "in memory" might be userspace memory, might be page cache memory, might be drive buffer memory.

I'm thinking there may be some flow control benefit to still having requests wait for their disk write (but not fsync), to avoid accepting writes faster than we can send them to disk, and then the user unexpectedly hitting a latency "wall" when they fill up our in memory buffer.

That's a really good point with hitting the latency wall. Let's make sure we add you to the reviewers of the PR. Good point!!!

Sep 29 '21 04:09 emaxerrno