Federated rooms from other homeservers regularly stop syncing, probably caused by enabled retention
Description
We have a federated homeserver with enabled retention policy. Frequently federated rooms from other home servers
stop syncing. In the log file we then observe KeyErrors like shown below in the section Relevant log output
The problem can be temporarily resolved by deleting the event_ids causing KeyErrors from the event_forward_extremities table:
delete from event_forward_extremities where event_id = '...';
Steps to reproduce
Assuming my guess about the root cause is correct:
- Setup a homeserver with retention enabled
- users join a room from another matrix server without retention times
- wait until the max lifetime of an event in
event_forward_extremitiesis reached and the retention policy is applied - the room stops syncing
Homeserver
matrix-homeserver.uni-marburg.de
Synapse Version
1.113.0
Installation Method
Debian packages from packages.matrix.org
Database
PostgreSQL 13.16 (Debian 13.16-0+deb11u1); single server; not ported from sqlite, not restored from backup
Workers
Multiple workers
Platform
Debian 11, VM
Configuration
- federation enabled
- retention enabled
retention:
enabled: true
default_policy:
min_lifetime: 1d
max_lifetime: 240d
allowed_lifetime_min: 1d
allowed_lifetime_max: 3654d
Relevant log output
2024-08-20 01:03:32,798 - synapse.http.server - 147 - ERROR - POST-2775242 - Failed handle request via 'ReplicationFederationSendEventsRestServlet': <SynapseRequest at 0x7fa10cfed070 method='POST' uri='/_synapse/replication/fed_send_events/jOVEPEWaEE' clientproto='HTTP/1.1' site='9008'>
Traceback (most recent call last):
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/internet/defer.py", line 2010, in _inlineCallbacks
result = context.run(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/failure.py", line 545, in throwExceptionIntoGenerator
return g.throw(self.value.with_traceback(self.tb))
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/util/caches/response_cache.py", line 265, in cb
return await callback(*args, **kwargs)
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/replication/http/federation.py", line 153, in _handle_request
max_stream_id = await self.federation_event_handler.persist_events_and_notify(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 2271, in persist_events_and_notify
) = await self._storage_controllers.persistence.persist_events(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 921, in _wrapper
return await func(*args, **kwargs)
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 427, in persist_events
ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 305, in yieldable_gather_results
raise dfe.subFailure.value from None
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/internet/defer.py", line 2010, in _inlineCallbacks
result = context.run(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/python/failure.py", line 545, in throwExceptionIntoGenerator
return g.throw(self.value.with_traceback(self.tb))
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 422, in enqueue
return await self._event_persist_queue.add_to_queue(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 245, in add_to_queue
res = await make_deferred_yieldable(end_item.deferred.observe())
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 288, in handle_queue_loop
ret = await self._per_item_callback(room_id, item.task)
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 368, in _process_event_persist_queue_task
return await self._persist_event_batch(room_id, task)
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 616, in _persist_event_batch
) = await self._calculate_new_forward_extremities_and_state_delta(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 708, in _calculate_new_forward_extremities_and_state_delta
res = await self._get_new_state_after_events(
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 894, in _get_new_state_after_events
old_state_groups = {
File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/persist_events.py", line 895, in <setcomp>
event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: '$i01NHjjt69O3x6hw409W3zVOn9xUNlXlINiCmH1q5XA'
Anything else that would be useful to know?
The event_id in the traceback above belongs to #python:matrix.org
select * from event_forward_extremities where event_id = '$i01NHjjt69O3x6hw409W3zVOn9xUNlXlINiCmH1q5XA'; event_id | room_id ----------------------------------------------+-------------------------------- $i01NHjjt69O3x6hw409W3zVOn9xUNlXlINiCmH1q5XA | !iuyQXswfjgxQMZGrfQ:matrix.org (1 row)