ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-10108. [hsync] Adopt RATIS-1994 to reduce hsync ops latency

Open smengcl opened this issue 2 years ago • 19 comments

What changes were proposed in this pull request?

Adopt the new Ratis API added in RATIS-1994 to (hopefully) reduce hsync latency.

Basically the idea is that ratis client send() call alone would suffice. And we could ditch watchForCommit in hsync (writeChunk + putBlock) path.

(!) Uses Ratis 3.1.0 snapshot build. Can be merged to feature branch.

Note:

  1. In order for failedServers list to be properly appended when a datanode lags behind in Ratis server watch() (in which case server throws NotReplicatedException), as checked by TestFailureHandlingByClient.testDatanodeExclusionWithMajorityCommit, significant adjustments has been made around sendRequest() and sendCommand().
  2. Datanodes raft.server.watch.timeout config has been lowered from 180s to 10s (Ratis default), as it has to be lower than raft.client.rpc.request.timeout (currently 60s in Ozone) for NotReplicatedException to be propagated

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10108

How was this patch tested?

  • Existing tests must all pass. Manual latency test pending.

smengcl avatar Jan 17 '24 00:01 smengcl

@smengcl , thanks for working on this!

TestBlockOutputStreamCorrectness can pass if we override watchForCommit(..) in CommitWatcher; see https://issues.apache.org/jira/secure/attachment/13066441/6014_idea.patch (Need to update TestCommitWatcher.testReleaseBuffersOnException)

szetszwo avatar Feb 02 '24 17:02 szetszwo

@smengcl , thanks for working on this!

TestBlockOutputStreamCorrectness can pass if we override watchForCommit(..) in CommitWatcher; see issues.apache.org/jira/secure/attachment/13066441/6014_idea.patch (Need to update TestCommitWatcher.testReleaseBuffersOnException)

Thanks @szetszwo for the help. It works!

smengcl avatar Feb 07 '24 19:02 smengcl

I noticed some refactoring (HDDS-10173) done in TestBlockOutputStreamCorrectness in master but not in the feature branch so I didn't change it at the moment to reduce merge conflict (test would still pass, the key is CommitWatcher).

smengcl avatar Feb 07 '24 19:02 smengcl

3 test failures remaining at this point, not including flaky ones:

  1. TestRandomKeyGenerator.testKeyLargerThan2GB -- somehow key write always fails with this large 2.01 GB key in this test. It can be reproduced locally. If I change the key size to 0.01 GB it passes. I suspect this has something to do with client write buffer management as well. But haven't quite figured out why.
  2. TestFailureHandlingByClient.testDatanodeExclusionWithMajorityCommit -- it fails because the test is expecting the datanode exclusion list to be populated with the datanode the test just shutdown. But it is not doing so because I modified watchForCommit to NOT actually do the Ratis request (remember? one major goal of this PR is to eliminate the second roundtrip). Now I'm trying to put the exclude datanode logic somewhere else while maintaining the overall client behavior.
  3. TestOzoneClientRetriesOnExceptions.testMaxRetriesByOzoneClient -- AssertionFailedError: Expected java.io.IOException to be thrown, but nothing was thrown looks to be the same type of issue as TestCommitWatcher#testReleaseBuffersOnException that I fixed already in 5c68ead09ca3c6235c998c338e9a64db957479fc. Maybe not.

smengcl avatar May 09 '24 14:05 smengcl

For 2: XceiverClientRatis#watchForCommit had this logic where it initially attempts ALL_COMMITTED, but when it failed to do a 3-way commit, it falls back to MAJORITY_COMMITTED, which is where it adds the failed datanode to the list.

XceiverClientRatis#sendCommandAsync might have to do exactly the same to keep the same behavior now that watchForCommit Ratis request is omitted. (Or is it worth keeping such behavior?)

smengcl avatar May 09 '24 23:05 smengcl

TestRandomKeyGenerator.testKeyLargerThan2GB -- somehow key write always fails with this large 2.01 GB key ...

2GB seems like a 32-bit signed int problem. We should try 1.01 GB.

szetszwo avatar May 10 '24 21:05 szetszwo