TimeoutError in `ScalingUpTest`.`test_adding_nodes_to_cluster`
Module: rptest.tests.scaling_up_test Class: ScalingUpTest Method: test_adding_nodes_to_cluster Arguments:
{
"partition_count": 1
}
in job https://buildkite.com/redpanda/redpanda/builds/18876#01849c1f-24b8-447e-956d-2b6f080f625b
test_id: rptest.tests.scaling_up_test.ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1
status: FAIL
run time: 1 minute 2.276 seconds
TimeoutError('')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/scaling_up_test.py", line 119, in test_adding_nodes_to_cluster
self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
File "/root/tests/rptest/tests/scaling_up_test.py", line 70, in wait_for_partitions_rebalanced
wait_until(partitions_rebalanced,
File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
The test is trying to balance 20 partition replicas, 16 of them of __consumer_offsets and 4 of a regular topic. The success criteria is number_of_replicas_per_node ∊ number_of_replicas / number_of_nodes ± 20% (here]. This translates to expected range: [5.333333333333334,8.0] (which is actually an exclusive range despite of square brackets logged).
The distribution of replicas per node the cluster settles with is
replicas per node: {1: 6, 2: 6, 3: 8}
This is totally normal since #5460, because there are two different topics in two disctinct domains being balanced, domain 0 settles with the distribution of [1,1,2] and domain -1 also allocates the remainder of replicas on node 3.
The test needs to be adjusted either to test balancing of partitions that belong to the same domain, or allow for corner cases like this.
https://buildkite.com/redpanda/redpanda/builds/18880#01849bff-7a20-4482-96f3-57f561592be8
@mmaslankaprv I think https://github.com/redpanda-data/redpanda/commit/f0f683be923457f7553065946035e5cc2c256b5c is unrelated to this issue, this one is not about timeouts: the cluster comes to a stable replicas balance in ~5s out of 30s timeout and replica distribution never improves after that.
another one: https://buildkite.com/redpanda/redpanda/builds/18948#0184a180-4e1e-4831-b5c7-57a05b7cacc8
Two more today:
FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (2/47 runs) failure at 2022-11-22T12:51:49.347Z: TimeoutError('') on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/18914#01849ed1-f732-4647-9c49-d36c17a01e7a failure at 2022-11-22T15:29:01.731Z: TimeoutError('') on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/18919#01849f4c-57cb-4516-b76b-37128ac44ad9
This is also still failing on 22.3.x https://buildkite.com/redpanda/redpanda/builds/19280#0184c8d9-35ac-4fd2-ad7a-d944b237ee90
another instance - https://buildkite.com/redpanda/redpanda/builds/19611#0184ff25-8370-463d-bac1-4ce05d050af4
This came back on nightly retest of dev https://ci-artifacts.dev.vectorized.cloud/redpanda/25291/0186ee17-bd4e-44c1-ba92-1cb8685ee2db/vbuild/ducktape/results/2023-03-17--001/ScalingUpTest/test_adding_nodes_to_cluster/partition_count=1/64/
https://buildkite.com/redpanda/redpanda/builds/25832#01871c7f-dd99-4fcd-814c-9301b5e20c01
https://buildkite.com/redpanda/redpanda/builds/25904#018726d9-2297-467c-b6c2-9b512ff330d1
https://buildkite.com/redpanda/redpanda/builds/25981#01872c34-5b91-42b8-9a86-5691604f122d/6-1895
https://buildkite.com/redpanda/redpanda/builds/26094#018732e3-c45c-4f1a-9be7-d4452f06974a
https://buildkite.com/redpanda/redpanda/builds/26146#01873657-4218-4a7b-b7cc-84f8c84e2c9f
https://buildkite.com/redpanda/redpanda/builds/26162#018737a2-3d1e-44b9-a979-7f117b1b243f https://buildkite.com/redpanda/redpanda/builds/26162#018737b3-d248-4be3-aac0-859f5bb01adf
https://buildkite.com/redpanda/redpanda/builds/26345#018747f8-1352-453a-831b-1093a3f9b029 https://buildkite.com/redpanda/redpanda/builds/26384#01874978-cf93-47c9-81de-c5d264ec2b22
https://buildkite.com/redpanda/redpanda/builds/26534#018753e0-35c1-4580-88a8-828b5bbbe6ee/6-1855
There are corner cases that still give almost the same failure, handled by #10024
https://buildkite.com/redpanda/redpanda/builds/29872#018851b3-5ba0-4acf-ac99-63298506a279
https://buildkite.com/redpanda/redpanda/builds/29872#018851b3-5ba0-4acf-ac99-63298506a279
This is the case of [1,1,2] distribution
replicas per domain per node: {-1: {1: 5, 2: 5, 3: 6}, 0: {1: 1, 2: 1, 3: 2}
moving over to #10024
Please do not reopen this issue; if you feel like want to create a new issue and link this one as a reference