ozone HDDS-9844. [hsync] De-synchronize write APIs (v3)

What changes were proposed in this pull request?

Same as https://github.com/apache/ozone/pull/6413 but a third impl based on master branch. FYI. NOT ready for review.

This is a much cleaner impl than v2 (#6413) after a bit more understanding of what is going on in the client with buffer/future stuff. But the tests won't pass at this point.

cc @duongkame

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9844

How was this patch tested?

See #6413

Jun 05 '24 21:06 smengcl

Just want to put down what was discussed offline.

The logic of hsync() consists of the following.

Update the current buffer and internal counters in the OutputStream states.

Use the current buffer and internal counters to compose a writeChunk request.

Send the writeChunk request.

Wait the writeChunk response and send a putBlock request.

Wait for putBlock reply and issue a watchForCommmit.

Wait for watchForCommit response and release retries buffers accordingly (in CommitWatcher).

Thanks @duongkame the comment. Overall I agree with your idea.

Commenting on some details:

In step 4, if I understand correctly with the current master branch impl, the writeChunk response is NOT actually waited on before sending out the putBlock request. BOS#writeChunkToContainer() returns a future but the caller BOS#writeChunk() doesn't even care.
In step 5, watchForCommit itself waits for all pending unacknowledged putBlock requests to finish, not before.

Effectively it is already sending writeChunk and putBlock in parallel, so we can further improve the parallelism.

Jun 10 '24 23:06 smengcl

Commenting on some details:

In step 4, if I understand correctly with the current master branch impl, the writeChunk response is NOT actually waited on before sending out the putBlock request. BOS#writeChunkToContainer() returns a future but the caller BOS#writeChunk() doesn't even care.

In step 5, watchForCommit itself waits for all pending unacknowledged putBlock requests to finish, not before.

Effectively it is already sending writeChunk and putBlock in parallel, so we can further improve the parallelism.

That's correct, @smengcl. I just found out how PutBlock data is done. The field BOS.containerBlockData accumulates all the writeChunk requests that have been sent by the BOS, and that data field is sent with putBlock to confirm with datanodes about the sent chunks. So, the putBlock can be sent without waiting for writeChunk, but it must be sent after (to ensure the order of sending requests to Ratis) and contain the previous writeChunk in its data. We don't event need to wait for writeChunk (as the current implementation) because the watchForCommit will wait for the PutBlock to get fully replicated (and putBlock is sent after writeChunk, so when putBlock is replicated, writeChunk sure is).

void hsync() {
	
	CompleteableFuture<PutBlockRespose> putBlockResponse;
	synchronize(this) {
		// 1. Update the current buffer and internal counters in the OutputStream states.
		// 2. Use the current buffer and internal counters to compose a `writeChunk` request.
		writeChunk = getWriteChunkRequest();
		// 3. Send the writeChunk request.
		sendWriteChunkAsync(writeChunk);
                // 4. Update putBlock data with the last writeChunk and send.
                this.containerBlockData.add(writeChunk);
                putBlockResp = executePutBlockAsync();
	}
        // 5. Wait for `putBlock` reply and issue a watchForCommmit.
	CompleteableFuture<WatchForCommitResponse> watchForCommitReps = putBlockResp.thenApply(x -> watchForCommit(x));
	// 6. Wait for `watchForCommit` response and release retries buffers accordingly (in CommitWatcher).
	WatchForCommitResponse watchForCommit = watchForCommitReps.get();
}

Lastly, by refactoring, I meant the required refactors to make the logic follow the order of logic (e.g. which blocks should be synchronized together and in what other, or how data is used, by passing from one step to another or getting from object context). We do not necessarily need to rebuild the current implementation.

Jun 11 '24 17:06 duongkame

Superceded by #6859

Jul 08 '24 21:07 smengcl