ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-10788. Intermittent failure in testWatchForCommitForRetryfailure

Open raju-balpande opened this issue 1 year ago • 5 comments

What changes were proposed in this pull request?

The intermittent failure in TestWatchForCommit.testWatchForCommitForRetryfailure Fixed the changes and tested it for 1000 runs.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10788

How was this patch tested?

Reproduced and tested the scenario in local system. Tested it in CI build. Earlier there were 4.5% failure ratio for 1000 runs https://github.com/raju-balpande/apache_ozone/actions/runs/9398762434 Now it succeed completely here https://github.com/raju-balpande/apache_ozone/actions/runs/9427808268

raju-balpande avatar Jun 10 '24 06:06 raju-balpande

Minor: the PR title is wrong, it should be HDDS-10788 instead of HDDS-10778

Galsza avatar Jun 10 '24 09:06 Galsza

@ibrusentsev Could you help to review since it might be related to your patch in https://github.com/apache/ozone/pull/5727?

ivandika3 avatar Jun 10 '24 10:06 ivandika3

@raju-balpande Thank you for the patch.

I agree with @adoroszlai that shutting down all DNs are not right. This would defeat the purpose of the test.

A bit of background, in Ozone's XceiverClientRatis#watchForCommit, there are two watch commits request with different ReplicationLevel

1. Watch for ALL_COMMITTED

2. Watch for MAJORITY_COMMITTED (If the previous watch threw an exception)

Based on the second watch request, the client will remove some failed datanode UUID from the commitInfoMap.

From my understanding testWatchForCommitFoRetryFailure is there to test what happens if two DNs (i.e. majority DNs) are down. Which means that both watch requests (ALL_COMMITTED and MAJORITY_COMMITTED) will both fail.

From the ticket comment, I believe issue might be that #5727 was changing the wrong test. It should have updated testWatchForCommitForGroupMismatchException instead of testWatchForCommitForRetryfailure.

So I suggest these changes:

1. Revert the changes for [HDDS-9826. Fix handling of GroupMismatchException in XceiverClientRatis.watchForCommit #5727](https://github.com/apache/ozone/pull/5727) on `TestWatchForCommit#testWatchForCommitForRetryFailure`

2. Apply the changes for [HDDS-9826. Fix handling of GroupMismatchException in XceiverClientRatis.watchForCommit #5727](https://github.com/apache/ozone/pull/5727)  to `TestWatchForCommit#testWatchForCommitForGroupMismatchException`
        // client should not attempt to watch with
        // MAJORITY_COMMITTED replication level, except the grpc IO issue
        if (!logCapturer.getOutput().contains("Connection refused")) {
          Assert.assertFalse(
              e.getMessage().contains("Watch-MAJORITY_COMMITTED"));
        }

Please take a look at #5727 for context and let me know what you think.

Thanks for quick explanation, This is giving the picture on the logic behind the recent changes. I checked the changes been done under that branch and shifted those checks into method TestWatchForCommit#testWatchForCommitForGroupMismatchException from TestWatchForCommit#testWatchForCommitForRetryFailure. In my system it is working fine. Checking the same on CI.

raju-balpande avatar Jun 10 '24 12:06 raju-balpande

@ivandika3 , As you suggested the changes,

  • I checked the changes been done under https://github.com/apache/ozone/pull/5727 and shifted those changes into method TestWatchForCommit#testWatchForCommitForGroupMismatchException from TestWatchForCommit#testWatchForCommitForRetryFailure.
  • In my system it is working fine. CI bulid is passed.
  • I checked the flakiness for TestWatchForCommit#testWatchForCommitForRetryFailure and is 100% success for 1000 runs as in https://github.com/raju-balpande/apache_ozone/actions/runs/9477452421
  • I checked the flakiness for TestWatchForCommit#testWatchForCommitForGroupMismatchException and is 100% success for 1000 runs as in https://github.com/raju-balpande/apache_ozone/actions/runs/9465541138 Can we go ahead to merge this pull request?

raju-balpande avatar Jun 12 '24 10:06 raju-balpande

@raju-balpande single test case results look good, but there are 34 failures when running the whole class: https://github.com/raju-balpande/apache_ozone/actions/runs/9450084317/job/26038239702

Can you please check?

adoroszlai avatar Jun 12 '24 10:06 adoroszlai