SORMAS-Project Improve performance of getAllAfter queries into DTOs [8]

Problem Description

As shown by the following analysis, many getAllAfter methods show an inperformant pattern:

The "Entity"Service.getAllAfter method takes some seconds. As shown in https://github.com/hzi-braunschweig/SORMAS-Project/issues/8946#issuecomment-1129937176, this can be improved by initially fetching only the ids (reduced distinct effort) and using a dedicated index with appropriate sorting.
"Entity"Service.inJurisdictionOrOwned per each entity used for pseudonymization seems to be inperformant. For Cases it took ~330ms per entity, for Persons ~0,3ms per entity with in clause (PersonService.getInJurisdictionIDs).

Analysis

Dataset:

1225343 persons
85677 cases
114782 contacts
236583 tasks
2600890 immunizations
1991 vaccinations
15693 samples
11846 events
534 eventparticipants

The following measurements were taken from backend logs (EJB methods) and the Postgres logs. Observations:

the actual SQL queries are executed in few seconds while the EJB methods take minutes to complete
a relevant contribution to runtime comes from inJurisdictionOrOwned methods, here mainly the number of calls
batch size is relevant. Some requests did not complete within reasonable time for a batch size of 10000, these were measured for batch size 1000.

Requests:

http://localhost:6080/sormas-rest/persons/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 3 min 5 sec server: 2 min 58 sec SQL queries: duration: 948.716 ms duration: 2297.275 ms duration: 4012.220 ms duration: 4393.538 ms duration: 696.277 ms ~ 12.5 sec Note: SQL query times are inaccurate, either due to an error in the analysis or changes introduced later, see later analysis of this method.
http://localhost:6080/sormas-rest/cases/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 8 min 21 sec server: 8 min 18 sec SQL queries: duration: 787.011 ms duration: 449.616 ms duration: 146.224 ms duration: 178.515 ms duration: 154.552 ms duration: 136.836 ms duration: 142.597 ms duration: 143.688 ms duration: 150.584 ms duration: 152.298 ms duration: 133.814 ms duration: 147.687 ms ~ 3 sec
http://localhost:6080/sormas-rest/contacts/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 35 sec server: 35 sec SQL queries: duration: 151.413 ms duration: 143.237 ms duration: 152.661 ms duration: 144.054 ms duration: 145.374 ms duration: 155.861 ms duration: 177.212 ms duration: 153.647 ms duration: 201.516 ms duration: 187.382 ms ~ 1.6 sec
http://localhost:6080/sormas-rest/tasks/all/1637090372005/1000/NO_LAST_SYNCED_UUID: 1 min 27 sec server: 1 min 24 sec SQL queries: duration: 466.097 ms duration: 154.093 ms duration: 132.150 ms duration: 137.751 ms duration: 140.757 ms duration: 135.526 ms duration: 136.612 ms duration: 126.399 ms duration: 124.695 ms duration: 127.888 ms duration: 133.940 ms ~1.9 sec
http://localhost:6080/sormas-rest/samples/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 27 sec server: 27 sec SQL queries: duration: 129.587 ms
http://localhost:6080/sormas-rest/immunizations/all/1637090372005/1000/NO_LAST_SYNCED_UUID (1min 50 sec) server: 1 min 48 sec SQL queries: duration: 6985.128 ms duration: 106.608 ms ~ 7 sec

Proposed Change

[x] 1. Rewrite AdoServiceWithUserFilter.getAllAfter to first fetch the needed ids (see pattern in PersonService.getAllAfter, then fetch the entities by id with IN-clause (use BaseAdoService.getByIds).
[x] 2. Add indices for sorting.
[x] 3. Use the pattern of PersonService.getInJurisdictionIDs to query by ids with IN clause also for other entities where "Entity"Service.inJurisdictionOrOwned is currently running one query per Entity.

Acceptance Criteria

[x] An analysis before/after shows the performance improvement (also mentioning the amount of existing and queried entities). -> comment

Implementation Details

All getAllAfter and getInJurisdictionIDs methods avoid parameter limit exception with IterableHelper.executeBatched batching.
Remove or adapt not with superclass aligned getAllAfter implementations in:
- CampaignService.getAllAfter
- CampaignFormMetaService.getAllAfter
- ContactService.getAllAfter (active contacts filter)
- EventParticipantService.getAllAfter (not archived filter)
- EventService.getAllAfter (configured EventUserFilterCriteria)
- ImmunizationService.getAllAfter
- BaseTravelEntryService.getAllAfter
Remove or adapt not with superclass aligned implementations parallel to getAllAfter:
- CampaignFormDataService.getAllActiveAfter
- TreatmentService.getAllActiveTreatmentsAfter
- PathogenTestService.getAllActivePathogenTestsAfter
- PrescriptionService.getAllActivePrescriptionsAfter
- SampleService.getAllActiveSamplesAfter
- ActionService.getAllActionsAfter
- AdditionalTestService.getAllActiveAdditionalTestsAfter
- ClinicalVisitService.getAllActiveClinicalVisitsAfter
- TaskService.getAllActiveTasksAfter
- CaseService.getAllActiveCasesAfter with boolean includeExtendedChangeDateFilters that is never used outside of tests (introduced with #2059, partly removed by #2674).

Additional Information

Sibling to #8946
Batch jurisdiction processing for Persons was done in #2130 (commit 305b169).
I tried to use CREATE INDEX CONCURRENTLY but that does not work within a transaction as it is used in the automatic db migration. If deployments get stuck because of INDEX creation (not sure what the likelyhood is), they can be created with CREATE INDEX CONCURRENTLY IF NOT EXISTS ... before on running instance.
This ticket improves the performance of the following functions:
1. Sync for android-app (getAllAfter methods) for most to all entities.
2. List (getIndexList) for AdditionalTest, PathogenTest
3. List (getByUuids, getBy"Reference"/"criteria") for Case, Sample, Task, AdditionalTest, PathogenTest, Prescription, Treatment, Vaccination, Visit.

Performance Analysis

Preliminary analysis of performance before and after the changes introduced in PR #10282

(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)

The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:

cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04

An exception is pathogen test with no improvement / slightly worse performance:

pathogentests (1000) .........00:01:15 -> 00:01:26

The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).

Analysis data: performance_9320.zip

Detailed analysis

Full list of REST calls Performance of REST calls (.../<entities>/all/) before the refacoring:

actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01

... and after:

actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00

Execution details for selected entities

Cases before:

after:

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.

Contacts before:

after:

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.

Immunizations before:

after:

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue

Tasks before:

after:

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.

Analysis for different batch sizes

A analysis with different batch sizes after completion of this task shows:

request durations are substantially improved (see times above)
there is little overhead (long running SQL queries etc.) for small batch sizes: for batch size 1, no entity exceeds a response time of 1second
for larger batch sizes, response time is approximately linear to the batch size
this is consistent across all entities (response time for 1000 pathogentests is down to 15 sec now)

Response times for selected entities:

cases
(1) ....................00:00:01
(10) ...................00:00:03
(100) ..................00:00:08
(500) ..................00:00:36
(1000) .................00:01:11
(5000) .................00:05:52

persons
(1) ..................00:00:01
(10) .................00:00:00
(100) ................00:00:01
(500) ................00:00:04
(1000) ...............00:00:07
(5000) ...............00:00:33

samples
(1) ..................00:00:01
(10) .................00:00:01
(100) ................00:00:02
(500) ................00:00:10
(1000) ...............00:00:17
(5000) ...............00:01:10

Analysis data: differentBatchSizes.zip

May 24 '22 12:05 StefanKock

Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since} No matter the size of the response e.g. even if only 1 task will be returned matching the time criteria. No response even for 5 minutes. The highest possible timespan that we got working on some instances is last 4 hours. Even that one takes a couple of minutes to respond. Let us know if you need more info around the issue and keep us posted about the outcome here.

May 31 '22 10:05 ClimedoHealthCodingTests

Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since}

Identifying the changed entities is what we want to improve. That includes also tasks.

May 31 '22 13:05 StefanKock

Perfect! Are there already some concrete plans when can this one be expected to make its way to production?

Jun 01 '22 07:06 ClimedoHealthCodingTests

Discussed current status and next steps with @HolgerReiseVSys and @syntakker. Results:

Update feature branch to current development branch
@syntakker will start with performance tests on cases, contacts, immunizations, tasks, samples as prechecks
Integrate CaseService.getAllActiveCasesAfter as all the other sibling methods.
Refactor BaseAdoService.getAll to getList to be reusable with other filters, make it protected or leave public?
Indizes for all entities that implement AdoServiceWithUserFilter

Sep 06 '22 08:09 StefanKock

Helper used to create sort indices: IndexBuilderHelper.xlsx

Sep 07 '22 15:09 StefanKock

Preliminary analysis of performance before and after the changes introduced in PR #10282

(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)

The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:

cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04

An exception is pathogen test with no improvement / slightly worse performance:

pathogentests (1000) .........00:01:15 -> 00:01:26

The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).

Analysis data: performance_9320.zip

Detailed analysis

Full list of REST calls Performance of REST calls (.../<entities>/all/) before the refacoring:

actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01

... and after:

actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00

Execution details for selected entities

Cases before:

after:

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.

Contacts before:

after:

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.

Immunizations before:

after:

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue

Tasks before:

after:

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.

Sep 14 '22 04:09 syntakker

Changes on performance tooling still open

Sep 21 '22 08:09 StefanKock