redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure in consumer_group_test.ConsumerGroupTest.test_basic_group_join

Open BenPope opened this issue 3 years ago • 1 comments

Version & Environment

Redpanda version: dev

https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

What went wrong?

CI Failure

What should have happened instead?

Ci Success

How to reproduce the issue?

???

Additional information

[INFO  - 2022-08-08 05:44:09,937 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.consumer_group_test.ConsumerGroupTest.test_basic_group_join.static_members=False: FAIL: TimeoutError('')
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/consumer_group_test.py", line 127, in test_basic_group_join
    wait_until(lambda: ConsumerGroupTest.consumed_at_least(consumers, 50),
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

BenPope avatar Aug 08 '22 08:08 BenPope

The test runs 2 consumers in the same group. The failing criteria is supposed to verify that each consumer consumes at least 50 messages out of 5000 from the topic. In this case one of the consumers gets all the 5000 messages and the other gets none.

Since the consumers are kafka-console-consumer.sh-based, there is no control when they begin to consume, so it is possible that the first one gets all messages before the other one is done joining the group. To remove the race from the test, it needs to be switched to a more advanced consumer.

RP logs in the test are at the INFO level so the above conclusion is not 100% verifiable.

Triage bottomline: race condition is in the test, not a RP bug, removing kind/bug.

dlex avatar Aug 08 '22 18:08 dlex

https://buildkite.com/redpanda/vtools/builds/3220#01829002-7fe5-4775-aba3-8fa06d20b3d3

Module: rptest.tests.consumer_group_test
Class:  ConsumerGroupTest
Method: test_basic_group_join
Arguments:
{
  "static_members": false
}

BenPope avatar Aug 12 '22 10:08 BenPope

Seen again in both the big and many partitions cases

FAIL test: ConsumerGroupTest.test_basic_group_join.static_members=False (1/24 runs) failure at 2022-08-17T07:38:31.224Z: TimeoutError('') in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80

stack trace:

====================================================================================================
test_id:    rptest.tests.consumer_group_test.ConsumerGroupTest.test_basic_group_join.static_members=False
status:     FAIL
run time:   46.391 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/tests/consumer_group_test.py", line 127, in test_basic_group_join
    wait_until(lambda: ConsumerGroupTest.consumed_at_least(consumers, 50),
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

andrwng avatar Aug 17 '22 18:08 andrwng