Sync To User perf improvements
This perf improvement exercise primarily began with a buggy release that made users without block-sync enabled to start resyncing (this issue was patched later). The impact it had on the DB and app servers was completely unsustainable and made us realize that the same is going to happen to users when we start releasing block-sync for real.
We wrote a quick script that simulates what would happen if all users in India started block-level syncing on the perf envs which have real production data.
The following is the series of improvements made by looking at traces and finding bottlenecks.
Query optimizations
We ran the script for 100 users first (100 coroutines across 6-8 threads) and the bottlenecks were complete DB-bound (datadog ref):

There were mainly two slow queries, which are very similar:
# BPs belonging to the current facility's registered patients
SELECT ? FROM blood_pressures
WHERE blood_pressures . patient_id IN( <patients subquery> )
AND blood_pressures . patient_id IN (<current facility patients>)
AND ( blood_pressures.updated_at >= ? )
ORDER BY blood_pressures . updated_at ASC LIMIT ?
# BPs belonging to patients other than the current facility
SELECT ? FROM blood_pressures
WHERE blood_pressures . patient_id IN (<patients subquery>)
AND blood_pressures . patient_id NOT IN (<current facility patients>)
AND ( blood_pressures.updated_at >= ? )
ORDER BY blood_pressures . updated_at ASC LIMIT ?
# The patients subquery is syncable_patients for a region
SELECT patients . id FROM (
(SELECT patients . * FROM patients WHERE patients . registration_facility_id IN ( ? ))
UNION (SELECT patients . * FROM patients WHERE patients . assigned_facility_id IN ( ? ))) patients)
UNION (SELECT patients . * FROM patients INNER JOIN appointments ON appointments . deleted_at IS ?
AND appointments . patient_id = ? . id WHERE appointments . facility_id IN ( ? ))
Index on (patient_id, updated_at) instead of just updated_at
On further investigation, we found the syncable_patients subquery on its own was decent, whereas simple BP queries were far too slow.
SELECT "blood_pressures".* FROM "blood_pressures"
WHERE "blood_pressures"."patient_id" IN (<1k ids>)
AND (blood_pressures.updated_at >= 2018-12-31 18:30:00 BC')
ORDER BY "blood_pressures"."updated_at" ASC LIMIT 1000
This took ~2-3s in a regular FG. We probed at some initial suspects:
- too many Patient IDs in the IN clause - the query was slow for a handful of patients as well;
- if indexes on
updated_atandpatient_idwere missing - they were present;
The query explain suggested that Postgres was not using the patient_id index at all and relied solely on the updated_at index. When we tried the query without the ORDER BY updated_at clause it was drastically faster (down to ~100ms). This time it used the patient_id index which was surprising, since we already had this in the first place.
Separately, removing the updated_at index (and keeping the patient_id index) also sped up the query. The query planner now used the patient_id index (see query explain). We don't have a definite answer to why the query planner chooses one index over the other but we have a guess for why the updated_at index was slower:
- When we do a re-sync, we search through data from the beginning of time, which means filtering by
updated_atdoesn't filter down any records. Narrowing records bypatient_idfirst is more efficient. - When we do a regular sync, we search through the latest records (BPs added since the last sync, for example). An
updated_atindex does help in this case over having just apatient_idindex.
To cover both the scenarios, we've added a compound index on patient_id and updated_at (https://github.com/simpledotorg/simple-server/pull/1900). The updated_at index is removed in https://github.com/simpledotorg/simple-server/pull/1901.
This change had the most impact on sync health overall. P99 latencies in the stress test went from consistently 6s+ to ~3s. This also had marked (re: insane) improvements to the DB CPU.
Splitting current_facility_records and other_facility_records
After fixing the index we found that the other_facility_records query was still slow (datadog ref).

This was easy to pinpoint to the NOT query.
def other_facility_records
other_facilities_limit = limit - current_facility_records.count
region_records
.where.not(patient: current_facility_patients)
.updated_on_server_since(other_facilities_processed_since, other_facilities_limit)
end
# region_records
def region_records
model.syncable_to_region(current_sync_region)
end
One would assume this query to:
- Fetch
region_records - Filter them by the
notcondition - Filter them by
updated_atand limit
However Postgres has no way of figuring out that the not query is a subset of region_records. The not is actually applied on the whole table and Postgres combines these at the end (EXPLAIN). This can be fixed with some query refactoring. We can tweak which patients to fetch the records for in ruby, like so:
model_sync_scope
.where(id: current_sync_region.syncable_patients - current_facility.syncable_patients)
.updated_on_server_since(other_facilities_processed_since, other_facilities_limit)
# model_sync_scope
def model_sync_scope
model.with_discarded
end
Similarly since current_facility_records also filters by registered_patients, it can be made more performant:
# Before
def current_facility_records
region_records
.where(patient: current_facility.registered_patients.with_discarded)
.updated_on_server_since(current_facility_processed_since, limit)
end
# After
def current_facility_records
model_sync_scope
.where(id: current_facility.syncable_patients)
.updated_on_server_since(current_facility_processed_since, limit)
end
Fixed in https://github.com/simpledotorg/simple-server/pull/1897
OR instead of UNION
Doing .or(assigned_patients) instead of .union(assigned_patients) in syncable_patients makes the query slightly faster as expected.
Controller Optimizations
Activerecord instantiation

We noticed a lot of time being taken by Patient activerecord.instantiation. This was happening in the freshly added where clauses: An easy fix and it shaved off a handsome 15% in response time:
# Before
model_sync_scope
.where(id: current_sync_region.syncable_patients - current_facility.syncable_patients)
# After
model_sync_scope
.where(id: current_sync_region.syncable_patients.pluck(:id) - current_facility.syncable_patients.pluck(:id))
OJ for JSON dumping
- Since our JSON payloads are quite large (~100kB) we expected adding a faster JSON dumping replacement to save us some time. We found OJ to be a good fit. This made JSON dumping 2.5x faster and cut roughly 10% in CPU load during stress tests.
Facilities controller
The facilities controller was firing 1+N queries to get each facility's block region id (8000 queries in total 😬 ). This warranted adding a with_block_region_id scope to fetch the block id in an includes. https://github.com/simpledotorg/simple-server/pull/1896
Note on datadog
Datadog.tracer.trace is your friend. This helped a ton in breaking apart the controller. Example usage
