db-sync icon indicating copy to clipboard operation
db-sync copied to clipboard

Race condition during sync of large projects can block dbsync

Open dracic opened this issue 4 months ago • 5 comments

Description

A race condition exists in dbsync that can block the synchronization process, especially with large projects that take a long time to download. When dbsync initiates a pull operation, and another client pushes a new version to the Mergin Maps server before the pull is complete, dbsync ends up with an outdated local version of the project.

This leads to a failure in the subsequent push operation, because of a strict version check that ensures the local version matches the server version. The push function raises an error: "There are pending changes on server - need to pull them first.". This creates a loop where dbsync is stuck trying to pull, but each pull is slow and susceptible to the same race condition, requiring manual intervention like --force-init, which can lead to data loss.

Why --force-init is not a solution

Using --force-init is a heavy-handed approach that wipes the local state and re-initializes the synchronization from scratch. This is not a viable solution in a production environment for several reasons:

  • Data Loss: If there are changes in the PostgreSQL database that have not been pushed to the Mergin Maps server, a --force-init will wipe the base and modified schemas and re-create them from the GeoPackage file. This will cause any changes made in the database to be lost.
  • Manual Intervention: The need for manual intervention defeats the purpose of an automated synchronization daemon.
  • Downtime: The re-initialization process can be time-consuming for large projects, leading to extended downtime for the synchronization service.

The problematic version check is located in the push function in dbsync.py:

# dbsync.py in push()
# ...
# check there are no pending changes on server
if server_version != local_version:
    raise DbSyncError("There are pending changes on server - need to pull them first.")

Real-world Scenario

  1. T0: dbsync starts a pull operation for a large project with many photos. The server is at version v100. The download is expected to take over a minute.
  2. T0 + 30s: A surveyor in the field finishes their work and syncs their mobile client. This creates version v101 on the Mergin Maps server.
  3. T0 + 90s: dbsync completes its download of v100 and applies the changes to the PostgreSQL database. The local project version for dbsync is now v100.
  4. T0 + 95s: The dbsync daemon proceeds to the push step to sync changes from the database back to Mergin Maps.
  5. Failure: The push operation detects that the server is at v101 while the local version is v100. It aborts the push, and dbsync is effectively blocked.

Proposed Solution

To resolve this, the push function should be made more resilient. Instead of immediately failing upon a version mismatch, it should attempt to resolve the situation automatically by pulling the latest changes.

The proposed solution is to modify the push function in dbsync.py. When a version mismatch is detected, dbsync should:

  1. Automatically trigger the pull function. The existing pull function is capable of handling a rebase of local database changes on top of the incoming server changes.
  2. After the pull is complete, re-check the version.
  3. If the versions now match, proceed with the push operation.
  4. If the versions still do not match after the automatic pull, then raise an error, as this would indicate a more serious problem that requires manual intervention.

This "pull-and-retry" mechanism would make the synchronization process more robust for projects with long download times and active collaboration, avoiding the need for manual resets.

dracic avatar Sep 05 '25 18:09 dracic

Hi @dracic, thanks for a detailed report. We are currently working on new pull / push mechanism in both server and py-client to make it more resilient also with retry mechanism (where it makes sense). Once we have these, we will propagate changes to db-sync. Another planned improvement for db-sync pull would be only download files needed and make it faster.

However, I am not sure if we have a capacity to fix the current pull / push issues as the focus it to get new sync released later this year.

varmar05 avatar Sep 10 '25 10:09 varmar05

Ok, I will make a PR as a workaround for this until the new version with new py-client is ready.

dracic avatar Sep 15 '25 10:09 dracic

Ok, I will make a PR as a workaround for this until the new version with new py-client is ready.

That would be great, thanks @dracic

varmar05 avatar Sep 16 '25 06:09 varmar05

@dracic Thanks for the write-up. I am confused though why db-sync gets blocked when that exception ("There are pending changes on server - need to pull them first.") happens - looking at the code, it should simply abort the upcoming push, and recover in the db-sync daemon's main loop, sleep for a bit and then start with another pull. Can you please clarify what happens on your end after that exception is raised - does db-sync daemon just stop altogether?

wonder-sk avatar Oct 01 '25 13:10 wonder-sk

I really don't know. Now i'm stopping dbsync container during work hours, but again:

== starting mergin-db-sync daemon == version 2.1.1 ==
Using config file: /config/config-tc.yaml
Logging in to Mergin...
Going to log in user dbsync
User dbsync successfully logged in.
Processing Mergin Maps project 'mergin/test'
Connecting to the database...
Modified and base schemas already exist
Working directory /tmp/dbsync/test already exists, with project version v1531
There are pending changes on server, please run pull command after init
Checking GeoPackage content...
Checking 'base' schema content...
Local project version at v1531 and base schema at v1531
Base schema changes:
trees                  0    1    0
trees_photo            0    2    0
The output GPKG file exists already but is not synchronized with db 'base' schema.Running `dbsync_deamon.py` with `--force-init` should fix the issue.

And it continues. Now I'm on v1559 on mergin. And Pg is on v1531. So it is obvious that --force-init is not the solution. Because we did init from db. SImple setup:

# How to connect to Mergin Maps server
mergin:
  url: https://mergin.test.com
  username: dbsync
  password: Passw0rd

init_from: db

connections:
   - driver: postgres
     conn_info: "host=172.17.0.1 dbname=test user=dbsync password=PgPass"
     modified: public
     base: base

     mergin_project: mergin/test
     sync_file: data.gpkg
     skip_tables:
      - spatial_ref_sys

dracic avatar Nov 01 '25 13:11 dracic