CI Failure in consumer_group_test.ConsumerGroupTest.test_basic_group_join
Version & Environment
Redpanda version: dev
https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe
What went wrong?
CI Failure
What should have happened instead?
Ci Success
How to reproduce the issue?
???
Additional information
[INFO - 2022-08-08 05:44:09,937 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.consumer_group_test.ConsumerGroupTest.test_basic_group_join.static_members=False: FAIL: TimeoutError('')
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/tests/consumer_group_test.py", line 127, in test_basic_group_join
wait_until(lambda: ConsumerGroupTest.consumed_at_least(consumers, 50),
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError
The test runs 2 consumers in the same group. The failing criteria is supposed to verify that each consumer consumes at least 50 messages out of 5000 from the topic. In this case one of the consumers gets all the 5000 messages and the other gets none.
Since the consumers are kafka-console-consumer.sh-based, there is no control when they begin to consume, so it is possible that the first one gets all messages before the other one is done joining the group. To remove the race from the test, it needs to be switched to a more advanced consumer.
RP logs in the test are at the INFO level so the above conclusion is not 100% verifiable.
Triage bottomline: race condition is in the test, not a RP bug, removing kind/bug.
https://buildkite.com/redpanda/vtools/builds/3220#01829002-7fe5-4775-aba3-8fa06d20b3d3
Module: rptest.tests.consumer_group_test
Class: ConsumerGroupTest
Method: test_basic_group_join
Arguments:
{
"static_members": false
}
Seen again in both the big and many partitions cases
FAIL test: ConsumerGroupTest.test_basic_group_join.static_members=False (1/24 runs) failure at 2022-08-17T07:38:31.224Z: TimeoutError('') in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80
stack trace:
====================================================================================================
test_id: rptest.tests.consumer_group_test.ConsumerGroupTest.test_basic_group_join.static_members=False
status: FAIL
run time: 46.391 seconds
TimeoutError('')
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/tests/consumer_group_test.py", line 127, in test_basic_group_join
wait_until(lambda: ConsumerGroupTest.consumed_at_least(consumers, 50),
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError