Race condition during sync of large projects can block dbsync
Description
A race condition exists in dbsync that can block the synchronization process, especially with large projects that take a long time to download. When dbsync initiates a pull operation, and another client pushes a new version to the Mergin Maps server before the pull is complete, dbsync ends up with an outdated local version of the project.
This leads to a failure in the subsequent push operation, because of a strict version check that ensures the local version matches the server version. The push function raises an error: "There are pending changes on server - need to pull them first.". This creates a loop where dbsync is stuck trying to pull, but each pull is slow and susceptible to the same race condition, requiring manual intervention like --force-init, which can lead to data loss.
Why --force-init is not a solution
Using --force-init is a heavy-handed approach that wipes the local state and re-initializes the synchronization from scratch. This is not a viable solution in a production environment for several reasons:
-
Data Loss: If there are changes in the PostgreSQL database that have not been pushed to the Mergin Maps server, a
--force-initwill wipe thebaseandmodifiedschemas and re-create them from the GeoPackage file. This will cause any changes made in the database to be lost. - Manual Intervention: The need for manual intervention defeats the purpose of an automated synchronization daemon.
- Downtime: The re-initialization process can be time-consuming for large projects, leading to extended downtime for the synchronization service.
The problematic version check is located in the push function in dbsync.py:
# dbsync.py in push()
# ...
# check there are no pending changes on server
if server_version != local_version:
raise DbSyncError("There are pending changes on server - need to pull them first.")
Real-world Scenario
-
T0:
dbsyncstarts apulloperation for a large project with many photos. The server is at versionv100. The download is expected to take over a minute. -
T0 + 30s: A surveyor in the field finishes their work and syncs their mobile client. This creates version
v101on the Mergin Maps server. -
T0 + 90s:
dbsynccompletes its download ofv100and applies the changes to the PostgreSQL database. The local project version fordbsyncis nowv100. -
T0 + 95s: The
dbsyncdaemon proceeds to thepushstep to sync changes from the database back to Mergin Maps. -
Failure: The
pushoperation detects that the server is atv101while the local version isv100. It aborts the push, anddbsyncis effectively blocked.
Proposed Solution
To resolve this, the push function should be made more resilient. Instead of immediately failing upon a version mismatch, it should attempt to resolve the situation automatically by pulling the latest changes.
The proposed solution is to modify the push function in dbsync.py. When a version mismatch is detected, dbsync should:
- Automatically trigger the
pullfunction. The existingpullfunction is capable of handling a rebase of local database changes on top of the incoming server changes. - After the
pullis complete, re-check the version. - If the versions now match, proceed with the
pushoperation. - If the versions still do not match after the automatic pull, then raise an error, as this would indicate a more serious problem that requires manual intervention.
This "pull-and-retry" mechanism would make the synchronization process more robust for projects with long download times and active collaboration, avoiding the need for manual resets.
Hi @dracic, thanks for a detailed report. We are currently working on new pull / push mechanism in both server and py-client to make it more resilient also with retry mechanism (where it makes sense). Once we have these, we will propagate changes to db-sync. Another planned improvement for db-sync pull would be only download files needed and make it faster.
However, I am not sure if we have a capacity to fix the current pull / push issues as the focus it to get new sync released later this year.
Ok, I will make a PR as a workaround for this until the new version with new py-client is ready.
Ok, I will make a PR as a workaround for this until the new version with new py-client is ready.
That would be great, thanks @dracic
@dracic Thanks for the write-up. I am confused though why db-sync gets blocked when that exception ("There are pending changes on server - need to pull them first.") happens - looking at the code, it should simply abort the upcoming push, and recover in the db-sync daemon's main loop, sleep for a bit and then start with another pull. Can you please clarify what happens on your end after that exception is raised - does db-sync daemon just stop altogether?
I really don't know. Now i'm stopping dbsync container during work hours, but again:
== starting mergin-db-sync daemon == version 2.1.1 ==
Using config file: /config/config-tc.yaml
Logging in to Mergin...
Going to log in user dbsync
User dbsync successfully logged in.
Processing Mergin Maps project 'mergin/test'
Connecting to the database...
Modified and base schemas already exist
Working directory /tmp/dbsync/test already exists, with project version v1531
There are pending changes on server, please run pull command after init
Checking GeoPackage content...
Checking 'base' schema content...
Local project version at v1531 and base schema at v1531
Base schema changes:
trees 0 1 0
trees_photo 0 2 0
The output GPKG file exists already but is not synchronized with db 'base' schema.Running `dbsync_deamon.py` with `--force-init` should fix the issue.
And it continues. Now I'm on v1559 on mergin. And Pg is on v1531. So it is obvious that --force-init is not the solution. Because we did init from db. SImple setup:
# How to connect to Mergin Maps server
mergin:
url: https://mergin.test.com
username: dbsync
password: Passw0rd
init_from: db
connections:
- driver: postgres
conn_info: "host=172.17.0.1 dbname=test user=dbsync password=PgPass"
modified: public
base: base
mergin_project: mergin/test
sync_file: data.gpkg
skip_tables:
- spatial_ref_sys