pd icon indicating copy to clipboard operation
pd copied to clipboard

TestRegionLabelDenyScheduler is flaky

Open okJiang opened this issue 1 year ago • 7 comments

Flaky Test

Which jobs are failing

TestRegionLabelDenyScheduler

CI link

https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/130/pipeline

Reason for failure (if possible)

grant_leader scheduler did not grant all regions except one denied region.

Following is the scheduled region id.

evict_leader.log

grant_leader.log

grant_leader scheduler does not schedule region(26, 94)

comm -3 <(sort evict_leader.log) <(sort grant_leader.log)
26
94

Anything else

okJiang avatar Jun 27 '24 07:06 okJiang

https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/144/pipeline/ test2.log test.log

comm -3 <(grep "op finish duration less than 10s" test.log | grep -oP '\[region-id=\K\d+' | sort) <(grep "op finish duration less than 10s" test2.log | grep -oP '\[region-id=\K\d+' | sort)
114
28
38

okJiang avatar Jun 27 '24 10:06 okJiang

/assign

okJiang avatar Jun 28 '24 03:06 okJiang

meet again https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/250/pipeline

=== RUN   TestRegionLabelDenyScheduler
[2024/07/05 16:06:36.782 +08:00] [INFO] [pd_service_discovery.go:1018] ["[pd] switch leader"] [new-leader=http://127.0.0.1:2382] [old-leader=http://127.0.0.1:2384]
    testutil.go:56: 
        	Error Trace:	/home/jenkins/agent/workspace/tikv/pd/pull_integration_realcluster_test/pd/client/testutil/testutil.go:56
        	            				/home/jenkins/agent/workspace/tikv/pd/pull_integration_realcluster_test/pd/tests/integrations/realcluster/scheduler_test.go:105
        	Error:      	Condition never satisfied
        	Test:       	TestRegionLabelDenyScheduler

HuSharp avatar Jul 05 '24 08:07 HuSharp

meet again https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/250/pipeline

=== RUN   TestRegionLabelDenyScheduler
[2024/07/05 16:06:36.782 +08:00] [INFO] [pd_service_discovery.go:1018] ["[pd] switch leader"] [new-leader=http://127.0.0.1:2382] [old-leader=http://127.0.0.1:2384]
    testutil.go:56: 
        	Error Trace:	/home/jenkins/agent/workspace/tikv/pd/pull_integration_realcluster_test/pd/client/testutil/testutil.go:56
        	            				/home/jenkins/agent/workspace/tikv/pd/pull_integration_realcluster_test/pd/tests/integrations/realcluster/scheduler_test.go:105
        	Error:      	Condition never satisfied
        	Test:       	TestRegionLabelDenyScheduler

This failure was caused by the previous test failure, which I have added in another issue https://github.com/tikv/pd/issues/8348#issuecomment-2219696341. So we can still close this issue, and we will discuss the instability of TestTransferLeader in another issue.

okJiang avatar Jul 10 '24 06:07 okJiang

https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/310/pipeline/

It seems like the 'stream not found' affected the grant-leader process, causing a timeout.

Still grant-leader in progress until timeout.

image

okJiang avatar Jul 15 '24 09:07 okJiang

https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/310/pipeline/

It seems like the 'stream not found' affected the grant-leader process, causing a timeout.

Still grant-leader in progress until timeout.

image

fixed by https://github.com/tikv/pd/pull/8394/commits/5941965e3ffcf694f395671217284e2f2a17730a

okJiang avatar Jul 16 '24 09:07 okJiang

meet again https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Fpd%2Fpull_integration_realcluster_test/detail/pull_integration_realcluster_test/467/

--- PASS: TestReloadLabel (63.86s)
=== RUN   TestTransferLeader
--- PASS: TestTransferLeader (3.07s)
=== RUN   TestRegionLabelDenyScheduler
    testutil.go:56: 
        	Error Trace:	/home/jenkins/agent/workspace/tikv/pd/pull_integration_realcluster_test/pd/client/testutil/testutil.go:56
        	            				/home/jenkins/agent/workspace/tikv/pd/pull_integration_realcluster_test/pd/tests/integrations/realcluster/scheduler_test.go:178
        	Error:      	Condition never satisfied
        	Test:       	TestRegionLabelDenyScheduler

HuSharp avatar Aug 01 '24 03:08 HuSharp

Haven't seen this issue in a while, close it

okJiang avatar Nov 18 '24 07:11 okJiang