redpanda TimeoutError in `ScalingUpTest`.`test_adding_nodes_to

Module: rptest.tests.scaling_up_test Class: ScalingUpTest Method: test_adding_nodes_to_cluster Arguments:

{
  "partition_count": 1
}

in job https://buildkite.com/redpanda/redpanda/builds/18876#01849c1f-24b8-447e-956d-2b6f080f625b

test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1
status:     FAIL
run time:   1 minute 2.276 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 119, in test_adding_nodes_to_cluster
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 70, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

The test is trying to balance 20 partition replicas, 16 of them of __consumer_offsets and 4 of a regular topic. The success criteria is number_of_replicas_per_node ∊ number_of_replicas / number_of_nodes ± 20% (here]. This translates to expected range: [5.333333333333334,8.0] (which is actually an exclusive range despite of square brackets logged).

The distribution of replicas per node the cluster settles with is

replicas per node: {1: 6, 2: 6, 3: 8}

This is totally normal since #5460, because there are two different topics in two disctinct domains being balanced, domain 0 settles with the distribution of [1,1,2] and domain -1 also allocates the remainder of replicas on node 3.

The test needs to be adjusted either to test balancing of partitions that belong to the same domain, or allow for corner cases like this.

Nov 22 '22 03:11 dlex

https://buildkite.com/redpanda/redpanda/builds/18880#01849bff-7a20-4482-96f3-57f561592be8

Nov 22 '22 10:11 jcsp

@mmaslankaprv I think https://github.com/redpanda-data/redpanda/commit/f0f683be923457f7553065946035e5cc2c256b5c is unrelated to this issue, this one is not about timeouts: the cluster comes to a stable replicas balance in ~5s out of 30s timeout and replica distribution never improves after that.

Nov 23 '22 01:11 dlex

another one: https://buildkite.com/redpanda/redpanda/builds/18948#0184a180-4e1e-4831-b5c7-57a05b7cacc8

Nov 23 '22 01:11 dlex

Two more today:

FAIL test: ScalingUpTest.test_adding_nodes_to_cluster.partition_count=1 (2/47 runs) failure at 2022-11-22T12:51:49.347Z: TimeoutError('') on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/18914#01849ed1-f732-4647-9c49-d36c17a01e7a failure at 2022-11-22T15:29:01.731Z: TimeoutError('') on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/18919#01849f4c-57cb-4516-b76b-37128ac44ad9

Nov 23 '22 19:11 jcsp

This is also still failing on 22.3.x https://buildkite.com/redpanda/redpanda/builds/19280#0184c8d9-35ac-4fd2-ad7a-d944b237ee90

Dec 01 '22 12:12 jcsp

another instance - https://buildkite.com/redpanda/redpanda/builds/19611#0184ff25-8370-463d-bac1-4ce05d050af4

Dec 11 '22 09:12 rystsov

This came back on nightly retest of dev https://ci-artifacts.dev.vectorized.cloud/redpanda/25291/0186ee17-bd4e-44c1-ba92-1cb8685ee2db/vbuild/ducktape/results/2023-03-17--001/ScalingUpTest/test_adding_nodes_to_cluster/partition_count=1/64/

Mar 17 '23 22:03 NyaliaLui

https://buildkite.com/redpanda/redpanda/builds/25832#01871c7f-dd99-4fcd-814c-9301b5e20c01

Mar 27 '23 23:03 rystsov

https://buildkite.com/redpanda/redpanda/builds/25904#018726d9-2297-467c-b6c2-9b512ff330d1

Mar 29 '23 10:03 abhijat

https://buildkite.com/redpanda/redpanda/builds/25981#01872c34-5b91-42b8-9a86-5691604f122d/6-1895

Mar 29 '23 10:03 Lazin

https://buildkite.com/redpanda/redpanda/builds/26094#018732e3-c45c-4f1a-9be7-d4452f06974a

Mar 30 '23 21:03 rystsov

https://buildkite.com/redpanda/redpanda/builds/26146#01873657-4218-4a7b-b7cc-84f8c84e2c9f

Mar 31 '23 12:03 abhijat

https://buildkite.com/redpanda/redpanda/builds/26162#018737a2-3d1e-44b9-a979-7f117b1b243f https://buildkite.com/redpanda/redpanda/builds/26162#018737b3-d248-4be3-aac0-859f5bb01adf

Mar 31 '23 14:03 VladLazar

https://buildkite.com/redpanda/redpanda/builds/26345#018747f8-1352-453a-831b-1093a3f9b029 https://buildkite.com/redpanda/redpanda/builds/26384#01874978-cf93-47c9-81de-c5d264ec2b22

Apr 04 '23 15:04 michael-redpanda

https://buildkite.com/redpanda/redpanda/builds/26534#018753e0-35c1-4580-88a8-828b5bbbe6ee/6-1855

Apr 06 '23 07:04 Lazin

There are corner cases that still give almost the same failure, handled by #10024

Apr 13 '23 00:04 dlex

https://buildkite.com/redpanda/redpanda/builds/29872#018851b3-5ba0-4acf-ac99-63298506a279

May 25 '23 15:05 michael-redpanda

https://buildkite.com/redpanda/redpanda/builds/29872#018851b3-5ba0-4acf-ac99-63298506a279

This is the case of [1,1,2] distribution

replicas per domain per node: {-1: {1: 5, 2: 5, 3: 6}, 0: {1: 1, 2: 1, 3: 2}

moving over to #10024

May 25 '23 15:05 dlex

Please do not reopen this issue; if you feel like want to create a new issue and link this one as a reference

Sep 03 '23 12:09 rystsov

TimeoutError in `ScalingUpTest`.`test_adding_nodes_to_cluster`