HDDS-10788. Intermittent failure in testWatchForCommitForRetryfailure
What changes were proposed in this pull request?
The intermittent failure in TestWatchForCommit.testWatchForCommitForRetryfailure Fixed the changes and tested it for 1000 runs.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10788
How was this patch tested?
Reproduced and tested the scenario in local system. Tested it in CI build. Earlier there were 4.5% failure ratio for 1000 runs https://github.com/raju-balpande/apache_ozone/actions/runs/9398762434 Now it succeed completely here https://github.com/raju-balpande/apache_ozone/actions/runs/9427808268
Minor: the PR title is wrong, it should be HDDS-10788 instead of HDDS-10778
@ibrusentsev Could you help to review since it might be related to your patch in https://github.com/apache/ozone/pull/5727?
@raju-balpande Thank you for the patch.
I agree with @adoroszlai that shutting down all DNs are not right. This would defeat the purpose of the test.
A bit of background, in Ozone's XceiverClientRatis#watchForCommit, there are two watch commits request with different ReplicationLevel
1. Watch for ALL_COMMITTED 2. Watch for MAJORITY_COMMITTED (If the previous watch threw an exception)Based on the second watch request, the client will remove some failed datanode UUID from the commitInfoMap.
From my understanding
testWatchForCommitFoRetryFailureis there to test what happens if two DNs (i.e. majority DNs) are down. Which means that both watch requests (ALL_COMMITTED and MAJORITY_COMMITTED) will both fail.From the ticket comment, I believe issue might be that #5727 was changing the wrong test. It should have updated
testWatchForCommitForGroupMismatchExceptioninstead oftestWatchForCommitForRetryfailure.So I suggest these changes:
1. Revert the changes for [HDDS-9826. Fix handling of GroupMismatchException in XceiverClientRatis.watchForCommit #5727](https://github.com/apache/ozone/pull/5727) on `TestWatchForCommit#testWatchForCommitForRetryFailure` 2. Apply the changes for [HDDS-9826. Fix handling of GroupMismatchException in XceiverClientRatis.watchForCommit #5727](https://github.com/apache/ozone/pull/5727) to `TestWatchForCommit#testWatchForCommitForGroupMismatchException`// client should not attempt to watch with // MAJORITY_COMMITTED replication level, except the grpc IO issue if (!logCapturer.getOutput().contains("Connection refused")) { Assert.assertFalse( e.getMessage().contains("Watch-MAJORITY_COMMITTED")); }Please take a look at #5727 for context and let me know what you think.
Thanks for quick explanation, This is giving the picture on the logic behind the recent changes. I checked the changes been done under that branch and shifted those checks into method TestWatchForCommit#testWatchForCommitForGroupMismatchException from TestWatchForCommit#testWatchForCommitForRetryFailure. In my system it is working fine. Checking the same on CI.
@ivandika3 , As you suggested the changes,
- I checked the changes been done under https://github.com/apache/ozone/pull/5727 and shifted those changes into method TestWatchForCommit#testWatchForCommitForGroupMismatchException from TestWatchForCommit#testWatchForCommitForRetryFailure.
- In my system it is working fine. CI bulid is passed.
- I checked the flakiness for
TestWatchForCommit#testWatchForCommitForRetryFailureand is 100% success for 1000 runs as in https://github.com/raju-balpande/apache_ozone/actions/runs/9477452421 - I checked the flakiness for
TestWatchForCommit#testWatchForCommitForGroupMismatchExceptionand is 100% success for 1000 runs as in https://github.com/raju-balpande/apache_ozone/actions/runs/9465541138 Can we go ahead to merge this pull request?
@raju-balpande single test case results look good, but there are 34 failures when running the whole class: https://github.com/raju-balpande/apache_ozone/actions/runs/9450084317/job/26038239702
Can you please check?