redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure (Attempted to perform operation: 'offset_data_stream()' on a closed segment) in `ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy`

Open oleiman opened this issue 2 years ago • 3 comments

https://buildkite.com/redpanda/redpanda/builds/44921#018d8b3e-c497-4199-963c-c72f2a2d127b

Module: rptest.tests.e2e_shadow_indexing_test
Class:  ShadowIndexingWhileBusyTest
Method: test_create_or_delete_topics_while_busy
Arguments:
{
  "cloud_storage_type": 1,
  "short_retention": false
}
====================================================================================================
test_id:    rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=False.cloud_storage_type=CloudStorageType.S3
status:     FAIL
run time:   2 minutes 44.294 seconds


    <BadLogLines nodes=docker-rp-7(1) example="ERROR 2024-02-09 00:40:06,124 [shard 0:fetc] kafka - fetch.cc:1171 - unknown exception thrown: std::runtime_error (Attempted to perform operation: 'offset_data_stream()' on a closed segment: {offset_tracker:{term:1, base_offset:1919, committed_offset:2002, dirty_offset:2002}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation={59}, reader={/var/lib/redpanda/data/kafka/topic-hnegwjyeyf/8_52/1919-1-v1.log, (21242491 bytes)}, writer=nullptr, cache={cache_size=7}, compaction_index:nullopt, closed=1, tombstone=1, index={file:/var/lib/redpanda/data/kafka/topic-hnegwjyeyf/8_52/1919-1-v1.base_index, offsets:{1919}, index:{header_bitflags:0, base_offset:{1919}, max_offset:{2002}, base_timestamp:{timestamp: 1707439192434}, max_timestamp:{timestamp: 1707439193269}, batch_timestamps_are_monotonic:0, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1707439194397}}, num_compactible_records_appended:{81}, index(27,27,27)}, step:32768, needs_persistence:0}})">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 173, in wrapped
    redpanda.raise_on_bad_logs(
  File "/root/tests/rptest/services/redpanda.py", line 1412, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-7(1) example="ERROR 2024-02-09 00:40:06,124 [shard 0:fetc] kafka - fetch.cc:1171 - unknown exception thrown: std::runtime_error (Attempted to perform operation: 'offset_data_stream()' on a closed segment: {offset_tracker:{term:1, base_offset:1919, committed_offset:2002, dirty_offset:2002}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation={59}, reader={/var/lib/redpanda/data/kafka/topic-hnegwjyeyf/8_52/1919-1-v1.log, (21242491 bytes)}, writer=nullptr, cache={cache_size=7}, compaction_index:nullopt, closed=1, tombstone=1, index={file:/var/lib/redpanda/data/kafka/topic-hnegwjyeyf/8_52/1919-1-v1.base_index, offsets:{1919}, index:{header_bitflags:0, base_offset:{1919}, max_offset:{2002}, base_timestamp:{timestamp: 1707439192434}, max_timestamp:{timestamp: 1707439193269}, batch_timestamps_are_monotonic:0, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1707439194397}}, num_compactible_records_appended:{81}, index(27,27,27)}, step:32768, needs_persistence:0}})">

oleiman avatar Feb 09 '24 23:02 oleiman

https://buildkite.com/redpanda/vtools/builds/11755

michael-redpanda avatar Feb 13 '24 20:02 michael-redpanda

*https://buildkite.com/redpanda/vtools/builds/11875 *https://buildkite.com/redpanda/vtools/builds/11880

vbotbuildovich avatar Feb 17 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11886 *https://buildkite.com/redpanda/vtools/builds/11891

vbotbuildovich avatar Feb 18 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11901

vbotbuildovich avatar Feb 19 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11905 *https://buildkite.com/redpanda/redpanda/builds/45108

vbotbuildovich avatar Feb 20 '24 00:02 vbotbuildovich

The offset_data_stream is opened when we open a log reader. The error is thrown because it looks like the segment is closed. Coudl be anything from the topic/partitoin being deleted, to being moved by the balancer, to being shutdown, to the segment being removed because of retention...

Overall this looks like a race condition and the log reader being created needs to synchronize with operations that affect the segment.

dotnwat avatar Feb 22 '24 19:02 dotnwat

in the log we can see the segment being closed

TRACE 2024-02-09 00:40:06,123 [shard 1:main] storage - segment.cc:90 - closing segment: {offset_tracker:{term:1, base_offset:1919, committed_offset:2002, dirty_offset:2002}, compacted_segment=0, finished_self_compaction=0, finished_windowed_compaction=0, generation={59}, reader={/var/lib/redpanda/data/kafka/topic-hnegwjyeyf/8_52/1919-1-v1.log, (21242491 bytes)}, writer=nullptr, cache={cache_size=7}, compaction_index:nullopt, closed=1, tombstone=1, index={file:/var/lib/redpanda/data/kafka/topic-hnegwjyeyf/8_52/1919-1-v1.base_index, offsets:{1919}, index:{header_bitflags:0, base_offset:{1919}, max_offset:{2002}, base_timestamp:{timestamp: 1707439192434}, max_timestamp:{timestamp: 1707439193269}, batch_timestamps_are_monotonic:0, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1707439194397}}, num_compactible_records_appended:{81}, index(27,27,27)}, step:32768, needs_persistence:0}} 

But then after this we see that readers to the same segment are being created, leading to the problem.

For what it is worth, it also looks like this might be a remote partition.

dotnwat avatar Feb 22 '24 19:02 dotnwat

I have a hunch that this may also be related to https://github.com/redpanda-data/redpanda/issues/16582 which also shows an exception from storage leaking up to fetch.cc.

dotnwat avatar Feb 22 '24 19:02 dotnwat

*https://buildkite.com/redpanda/redpanda/builds/45279 *https://buildkite.com/redpanda/vtools/builds/11971

vbotbuildovich avatar Feb 24 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11978

vbotbuildovich avatar Feb 27 '24 06:02 vbotbuildovich

Given this stopped happening around the time #16582 got fixed and Noah's comment above, going to close as presumed fixed.

piyushredpanda avatar Mar 23 '24 21:03 piyushredpanda