wal icon indicating copy to clipboard operation
wal copied to clipboard

fix: support concurrent read/write from same active segment.

Open ankur-anand opened this issue 1 year ago • 6 comments

Currently, wal.Read supports concurrent reads without data races only when using the exact chunk position, as it assumes that WAL is actively being written. https://github.com/rosedblabs/wal/blob/9f1a618784955f220a1f62842d21bd3ecc72e51d/wal.go#L426

However, if parallel/concurrent log replay is needed, the active segment modifies certain fields in the segment struct during writes. These same fields are also accessed by reader instances, leading to potential data races:

currentBlockNumber uint32
currentBlockSize   uint32
closed             bool

To enable safe concurrent reads while the WAL is being written, this PR updates these fields to use atomic operations:

currentBlockNumber atomic.Uint32
currentBlockSize   atomic.Uint32
closed             atomic.Bool

By making these changes, the WAL will support parallel log replay from different readers without data races, ensuring consistency and thread safety.

Changes in this PR • Replaced uint32 and bool fields with their atomic counterparts. • Ensured all accesses to these fields use atomic operations.

ankur-anand avatar Feb 07 '25 13:02 ankur-anand

Some of the concurrent race conditions that were observed.

==================
WARNING: DATA RACE
Write at 0x00c00003e978 by goroutine 54:
  github.com/rosedblabs/wal.(*segment).Close()
      /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/segment.go:172 +0x448
  github.com/rosedblabs/wal.(*WAL).Close()
      /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/wal.go:442 +0x40c
      
      
Previous read at 0x00c00003e978 by goroutine 80:
  github.com/rosedblabs/wal.(*segmentReader).Next()
      /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/segment.go:481 +0x48
  github.com/rosedblabs/wal.(*Reader).Next()
      /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/wal.go:262 +0xb4
==================
WARNING: DATA RACE
Write at 0x00c00003e978 by goroutine 54:
github.com/rosedblabs/wal.(*segment).Close()
   /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/segment.go:172 +0x448
github.com/rosedblabs/wal.(*WAL).Close()
   /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/wal.go:442 +0x40c
  
Previous read at 0x00c00003e978 by goroutine 80:
github.com/rosedblabs/wal.(*segmentReader).Next()
   /Users/ankur/go/pkg/mod/github.com/rosedblabs/[email protected]/segment.go:481 +0x48
github.com/rosedblabs/wal.(*Reader).Next()

I agree that the closed variable data race,part will not happen until Wal.Close is called concurrently without Releasing the reader, but should be protected from a data-race condition.

ankur-anand avatar Feb 07 '25 13:02 ankur-anand

\cc @roseduan

ankur-anand avatar Feb 07 '25 16:02 ankur-anand

\cc @roseduan

Thanks, I will recheck this later.

roseduan avatar Feb 10 '25 13:02 roseduan

I understand the purpose of this PR:

  1. When WAL instance executes NewReader, it adds the active segment file to Readers without setting a new active segment file
  2. Since Next is not locked, it may concurrently execute readInternal method on the active segment file
  3. The readInternal method accesses fields like currentBlockNumber, currentBlockSize, and closed, while Write method may modify these fields simultaneously, potentially causing data races

XiXi-2024 avatar Feb 17 '25 07:02 XiXi-2024

I believe changing to atomic types won't solve the concurrency issues. For the closed field, it may still lead to reading from a closed file, and since currentBlockNumber and currentBlockSize fields need to be updated synchronously, concurrent access would still cause race conditions.

XiXi-2024 avatar Feb 17 '25 07:02 XiXi-2024

@XiXi-2024

I believe changing to atomic types won't solve the concurrency issues. For the closed field, it may still lead to reading from a closed file, and since currentBlockNumber and currentBlockSize fields need to be updated synchronously, concurrent access would still cause race conditions.

You're right. But all the descriptors were closed, only when Wal is closed. So most of reader should be closed by the application owner before calling the wal.Close.

But then again this leaves the library not dealing with it, which should be avoided. I've updated the PR to cover those cases.

ankur-anand avatar Feb 20 '25 14:02 ankur-anand