cw-orchestrator icon indicating copy to clipboard operation
cw-orchestrator copied to clipboard

Fix undefined behavior on simultaneous write in daemon state file

Open Buckram123 opened this issue 2 years ago • 14 comments

Simultaneous daemon state file writes currently are not handled. This PR aims to handle such cases. So there are 2 scenarios when simultaneous write can happen:

  1. Another program that uses daemon state in the process of writing in this file
  2. Another daemon of this program tries to write into a file

How this PR tries to fix it: For the 1.

  • We lock a file with file-lock, which ensures another program won't be able to write in this file (We don't lock read-only daemonstates)

For the 2.

  • We create global state of locked files by this program with once_cell::sync::Lazy.
  • We disallow other daemons from any thread to access this lock, unless it's cloned directly from the daemon, who originally locked it.
  • As soon as every daemon dropped state we unlock file

Beyond safety this PR adds few optimizations:

  • Write json to a file only when all of the state clones done with using this state file
  • Read from a shared state instead of file, when it's in a locked state

Buckram123 avatar Feb 12 '24 13:02 Buckram123

Deploying cw-orchestrator with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7eae16a
Status:⚡️  Build in progress...

View logs

It also can corrupt state, in such cases:

{
  "juno": {
    "juno-1": {
      "code_ids": {},
      "test": {
        "test0": "test-0",      }  "te}
  :
"test-3"
      }
    }
  }
}

Buckram123 avatar Feb 12 '24 13:02 Buckram123

Codecov Report

Attention: Patch coverage is 80.15267% with 52 lines in your changes missing coverage. Please review.

Project coverage is 56.6%. Comparing base (9c43514) to head (a795bbe).

Additional details and impacted files
Files Coverage Δ
cw-orch-daemon/src/channel.rs 83.7% <100.0%> (+0.7%) :arrow_up:
cw-orch-daemon/src/error.rs 57.1% <ø> (ø)
cw-orch-daemon/src/queriers/cosmwasm.rs 51.5% <100.0%> (ø)
cw-orch-daemon/src/sync/core.rs 78.1% <100.0%> (+3.1%) :arrow_up:
cw-orch-daemon/src/tx_builder.rs 71.2% <100.0%> (ø)
packages/clone-testing/src/state.rs 100.0% <100.0%> (ø)
cw-orch-daemon/src/sync/builder.rs 88.6% <63.6%> (-5.6%) :arrow_down:
cw-orch-daemon/src/json_lock.rs 92.1% <92.1%> (ø)
cw-orch-daemon/src/builder.rs 69.4% <78.5%> (+1.6%) :arrow_up:
cw-orch-daemon/src/core.rs 58.6% <33.3%> (+0.9%) :arrow_up:
... and 3 more

... and 1 file with indirect coverage changes

codecov[bot] avatar Feb 12 '24 13:02 codecov[bot]

This is particularly crucial when using daemons that interact with different chains !

Kayanski avatar Feb 12 '24 13:02 Kayanski

This is particularly crucial when using daemons that interact with different chains !

This is also needed when interacting with 1 chain

Buckram123 avatar Feb 12 '24 13:02 Buckram123

It also can corrupt state, in such cases

Update: seems like it was because of mid-way aborted thread.

Buckram123 avatar Feb 12 '24 13:02 Buckram123

@Kayanski wrote a quick proof of concept implementation, that don't break any existing test as well as succeeds the new multi-threaded ones

Buckram123 avatar Feb 13 '24 12:02 Buckram123

I'm a bit confused by the global Mutex. It seems like you're using a locking library (lock-file) but then added your own locking solution on top. Why did you do that? Could you explain your implementation (preferably in the description of the PR)?

CyberHoward avatar Feb 13 '24 20:02 CyberHoward

I'm a bit confused by the global Mutex. It seems like you're using a locking library (lock-file) but then added your own locking solution on top. Why did you do that? Could you explain your implementation (preferably in the description of the PR)?

Updated PR description! This PR originally had only failing tests with missleading error "EOF while parsing a value", which you can find in cw-orch-daemon/tests/daemon_state.rs, to discuss how we want to handle it.

Buckram123 avatar Feb 14 '24 11:02 Buckram123

See clippy err

CyberHoward avatar Apr 09 '24 09:04 CyberHoward

So if I understand this PR correctly the JSON state file will be locked when the Daemon is constructed? Any other threads that attempt to construct a Daemon will be blocked until this original Daemon is dropped?

  • File locked after first daemon points to the file
  • When using state daemon locks json and unlocks after use
    • Other threads of daemons will be blocked for write/read until previous finished his action
  • As soon as all of the daemons dropped "pointer" to state file we write result to file
  • If daemon tries to use state file that's already locked by another process it will panic, to prevent hang (see tests)

Buckram123 avatar Apr 22 '24 08:04 Buckram123

So if I understand this PR correctly the JSON state file will be locked when the Daemon is constructed? Any other threads that attempt to construct a Daemon will be blocked until this original Daemon is dropped?

  • File locked after first daemon points to the file

  • When using state daemon locks json and unlocks after use

    • Other threads of daemons will be blocked for write/read until previous finished his action
  • As soon as all of the daemons dropped "pointer" to state file we write result to file

  • If daemon tries to use state file that's already locked by another process it will panic, to prevent hang (see tests)

Referring to our discussion on Discord. Do we all agree that the second daemon that attempts to access the file should throw an err instead of waiting for the lock to drop?

CyberHoward avatar Apr 23 '24 20:04 CyberHoward

So if I understand this PR correctly the JSON state file will be locked when the Daemon is constructed? Any other threads that attempt to construct a Daemon will be blocked until this original Daemon is dropped?

  • File locked after first daemon points to the file

  • When using state daemon locks json and unlocks after use

    • Other threads of daemons will be blocked for write/read until previous finished his action
  • As soon as all of the daemons dropped "pointer" to state file we write result to file

  • If daemon tries to use state file that's already locked by another process it will panic, to prevent hang (see tests)

Referring to our discussion on Discord. Do we all agree that the second daemon that attempts to access the file should throw an err instead of waiting for the lock to drop?

But other Daemons can still use this state file if only it's clone of original daemon, right?

Buckram123 avatar Apr 23 '24 21:04 Buckram123

So if I understand this PR correctly the JSON state file will be locked when the Daemon is constructed? Any other threads that attempt to construct a Daemon will be blocked until this original Daemon is dropped?

  • File locked after first daemon points to the file

  • When using state daemon locks json and unlocks after use

    • Other threads of daemons will be blocked for write/read until previous finished his action
  • As soon as all of the daemons dropped "pointer" to state file we write result to file

  • If daemon tries to use state file that's already locked by another process it will panic, to prevent hang (see tests)

Referring to our discussion on Discord. Do we all agree that the second daemon that attempts to access the file should throw an err instead of waiting for the lock to drop?

But other Daemons can still use this state file if only it's clone of original daemon, right?

Yes! So the ownership check should be performed on daemon(state) construction. Clones shouldn't check.

CyberHoward avatar Apr 24 '24 08:04 CyberHoward

@Kayanski @CyberHoward On poisoned Mutex state should we allow saving state in case user called manually force_write? Alternatively we can add parameter ignore_poison to give more control

Buckram123 avatar May 22 '24 13:05 Buckram123

@Kayanski @CyberHoward On poisoned Mutex state should we allow saving state in case user called manually force_write? Alternatively we can add parameter ignore_poison to give more control

Let's fail if there's a poison, we can develop the feature if someone asks later

Kayanski avatar Jun 04 '24 15:06 Kayanski