Sync To User perf improvements

Open kitallis opened this issue 5 years ago • 0 comments

This perf improvement exercise primarily began with a buggy release that made users without block-sync enabled to start resyncing (this issue was patched later). The impact it had on the DB and app servers was completely unsustainable and made us realize that the same is going to happen to users when we start releasing block-sync for real.

We wrote a quick script that simulates what would happen if all users in India started block-level syncing on the perf envs which have real production data.

The following is the series of improvements made by looking at traces and finding bottlenecks.

Query optimizations

We ran the script for 100 users first (100 coroutines across 6-8 threads) and the bottlenecks were complete DB-bound (datadog ref): Screenshot 2020-12-18 at 3 24 17 PM

There were mainly two slow queries, which are very similar:

# BPs belonging to the current facility's registered patients
SELECT ? FROM blood_pressures 
WHERE blood_pressures . patient_id IN( <patients subquery> ) 
AND blood_pressures . patient_id IN (<current facility patients>)
AND ( blood_pressures.updated_at >= ? )
ORDER BY blood_pressures . updated_at ASC LIMIT ? 

# BPs belonging to patients other than the current facility
SELECT ? FROM blood_pressures 
WHERE blood_pressures . patient_id IN (<patients subquery>) 
AND blood_pressures . patient_id NOT IN (<current facility patients>)
AND ( blood_pressures.updated_at >= ? )
ORDER BY blood_pressures . updated_at ASC LIMIT ? 

# The patients subquery is syncable_patients for a region
SELECT patients . id FROM (
  (SELECT patients . * FROM patients WHERE patients . registration_facility_id IN ( ? )) 
   UNION (SELECT patients . * FROM patients WHERE patients . assigned_facility_id IN ( ? ))) patients)
   UNION (SELECT patients . * FROM patients INNER JOIN appointments ON appointments . deleted_at IS ? 
                  AND appointments . patient_id = ? . id WHERE appointments . facility_id IN ( ? ))

Index on (patient_id, updated_at) instead of just updated_at

On further investigation, we found the syncable_patients subquery on its own was decent, whereas simple BP queries were far too slow.

SELECT  "blood_pressures".* FROM "blood_pressures" 
WHERE "blood_pressures"."patient_id" IN (<1k ids>)
AND (blood_pressures.updated_at >= 2018-12-31 18:30:00 BC') 
ORDER BY "blood_pressures"."updated_at" ASC LIMIT 1000

This took ~2-3s in a regular FG. We probed at some initial suspects:

too many Patient IDs in the IN clause - the query was slow for a handful of patients as well;
if indexes on updated_at and patient_id were missing - they were present;

The query explain suggested that Postgres was not using the patient_id index at all and relied solely on the updated_at index. When we tried the query without the ORDER BY updated_at clause it was drastically faster (down to ~100ms). This time it used the patient_id index which was surprising, since we already had this in the first place.

Separately, removing the updated_at index (and keeping the patient_id index) also sped up the query. The query planner now used the patient_id index (see query explain). We don't have a definite answer to why the query planner chooses one index over the other but we have a guess for why the updated_at index was slower:

When we do a re-sync, we search through data from the beginning of time, which means filtering by updated_at doesn't filter down any records. Narrowing records by patient_id first is more efficient.
When we do a regular sync, we search through the latest records (BPs added since the last sync, for example). An updated_at index does help in this case over having just a patient_id index.

To cover both the scenarios, we've added a compound index on patient_id and updated_at (https://github.com/simpledotorg/simple-server/pull/1900). The updated_at index is removed in https://github.com/simpledotorg/simple-server/pull/1901.

This change had the most impact on sync health overall. P99 latencies in the stress test went from consistently 6s+ to ~3s. This also had marked (re: insane) improvements to the DB CPU.

Splitting current_facility_records and other_facility_records

After fixing the index we found that the other_facility_records query was still slow (datadog ref).

This was easy to pinpoint to the NOT query.

def other_facility_records
  other_facilities_limit = limit - current_facility_records.count
  region_records
    .where.not(patient: current_facility_patients)
    .updated_on_server_since(other_facilities_processed_since, other_facilities_limit)
end

# region_records
def region_records
  model.syncable_to_region(current_sync_region)
end

One would assume this query to:

Fetch region_records
Filter them by the not condition
Filter them by updated_at and limit

However Postgres has no way of figuring out that the not query is a subset of region_records. The not is actually applied on the whole table and Postgres combines these at the end (EXPLAIN). This can be fixed with some query refactoring. We can tweak which patients to fetch the records for in ruby, like so:

model_sync_scope
  .where(id: current_sync_region.syncable_patients - current_facility.syncable_patients)
  .updated_on_server_since(other_facilities_processed_since, other_facilities_limit)

# model_sync_scope
def model_sync_scope
  model.with_discarded
end

Similarly since current_facility_records also filters by registered_patients, it can be made more performant:

# Before
def current_facility_records
  region_records
    .where(patient: current_facility.registered_patients.with_discarded)
    .updated_on_server_since(current_facility_processed_since, limit)
end

# After
def current_facility_records
  model_sync_scope
    .where(id: current_facility.syncable_patients)
    .updated_on_server_since(current_facility_processed_since, limit)
end

Fixed in https://github.com/simpledotorg/simple-server/pull/1897

`OR` instead of `UNION`

Doing .or(assigned_patients) instead of .union(assigned_patients) in syncable_patients makes the query slightly faster as expected.

Controller Optimizations

Activerecord instantiation

We noticed a lot of time being taken by Patient activerecord.instantiation. This was happening in the freshly added where clauses: An easy fix and it shaved off a handsome 15% in response time:

# Before
model_sync_scope
  .where(id: current_sync_region.syncable_patients - current_facility.syncable_patients)

# After
model_sync_scope
  .where(id: current_sync_region.syncable_patients.pluck(:id) - current_facility.syncable_patients.pluck(:id))

OJ for JSON dumping

Since our JSON payloads are quite large (~100kB) we expected adding a faster JSON dumping replacement to save us some time. We found OJ to be a good fit. This made JSON dumping 2.5x faster and cut roughly 10% in CPU load during stress tests.

Facilities controller

The facilities controller was firing 1+N queries to get each facility's block region id (8000 queries in total 😬 ). This warranted adding a with_block_region_id scope to fetch the block id in an includes. https://github.com/simpledotorg/simple-server/pull/1896

Note on datadog

Datadog.tracer.trace is your friend. This helped a ton in breaking apart the controller. Example usage

Dec 18 '20 09:12 kitallis

Sync To User perf improvements

Query optimizations

Index on (patient_id, updated_at) instead of just updated_at

Splitting current_facility_records and other_facility_records

OR instead of UNION

Controller Optimizations

Activerecord instantiation

OJ for JSON dumping

Facilities controller

Note on datadog

`OR` instead of `UNION`