Improve performance of getAllAfter queries into DTOs [8]
Problem Description
As shown by the following analysis, many getAllAfter methods show an inperformant pattern:
- The
"Entity"Service.getAllAftermethod takes some seconds. As shown in https://github.com/hzi-braunschweig/SORMAS-Project/issues/8946#issuecomment-1129937176, this can be improved by initially fetching only the ids (reduced distinct effort) and using a dedicated index with appropriate sorting. -
"Entity"Service.inJurisdictionOrOwnedper each entity used for pseudonymization seems to be inperformant. For Cases it took ~330ms per entity, for Persons ~0,3ms per entity with in clause (PersonService.getInJurisdictionIDs).
Analysis
Dataset:
- 1225343 persons
- 85677 cases
- 114782 contacts
- 236583 tasks
- 2600890 immunizations
- 1991 vaccinations
- 15693 samples
- 11846 events
- 534 eventparticipants
The following measurements were taken from backend logs (EJB methods) and the Postgres logs. Observations:
- the actual SQL queries are executed in few seconds while the EJB methods take minutes to complete
- a relevant contribution to runtime comes from
inJurisdictionOrOwnedmethods, here mainly the number of calls - batch size is relevant. Some requests did not complete within reasonable time for a batch size of 10000, these were measured for batch size 1000.
Requests:
-
http://localhost:6080/sormas-rest/persons/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 3 min 5 sec server: 2 min 58 sec
SQL queries:
duration: 948.716 ms
duration: 2297.275 ms
duration: 4012.220 ms
duration: 4393.538 ms
duration: 696.277 ms
~ 12.5 sec
Note: SQL query times are inaccurate, either due to an error in the analysis or changes introduced later, see later analysis of this method. -
http://localhost:6080/sormas-rest/cases/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 8 min 21 sec server: 8 min 18 sec
SQL queries:
duration: 787.011 ms
duration: 449.616 ms
duration: 146.224 ms
duration: 178.515 ms
duration: 154.552 ms
duration: 136.836 ms
duration: 142.597 ms
duration: 143.688 ms
duration: 150.584 ms
duration: 152.298 ms
duration: 133.814 ms
duration: 147.687 ms
~ 3 sec -
http://localhost:6080/sormas-rest/contacts/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 35 sec server: 35 sec
SQL queries:
duration: 151.413 ms
duration: 143.237 ms
duration: 152.661 ms
duration: 144.054 ms
duration: 145.374 ms
duration: 155.861 ms
duration: 177.212 ms
duration: 153.647 ms
duration: 201.516 ms
duration: 187.382 ms
~ 1.6 sec -
http://localhost:6080/sormas-rest/tasks/all/1637090372005/1000/NO_LAST_SYNCED_UUID: 1 min 27 sec server: 1 min 24 sec
SQL queries:
duration: 466.097 ms
duration: 154.093 ms
duration: 132.150 ms
duration: 137.751 ms
duration: 140.757 ms
duration: 135.526 ms
duration: 136.612 ms
duration: 126.399 ms
duration: 124.695 ms
duration: 127.888 ms
duration: 133.940 ms
~1.9 sec -
http://localhost:6080/sormas-rest/samples/all/1637090372005/10000/NO_LAST_SYNCED_UUID: 27 sec server: 27 sec
SQL queries:
duration: 129.587 ms -
http://localhost:6080/sormas-rest/immunizations/all/1637090372005/1000/NO_LAST_SYNCED_UUID(1min 50 sec) server: 1 min 48 sec
SQL queries:
duration: 6985.128 ms
duration: 106.608 ms
~ 7 sec
Proposed Change
- [x] 1. Rewrite
AdoServiceWithUserFilter.getAllAfterto first fetch the needed ids (see pattern inPersonService.getAllAfter, then fetch the entities by id with IN-clause (useBaseAdoService.getByIds). - [x] 2. Add indices for sorting.
- [x] 3. Use the pattern of
PersonService.getInJurisdictionIDsto query by ids with IN clause also for other entities where"Entity"Service.inJurisdictionOrOwnedis currently running one query per Entity.
Acceptance Criteria
- [x] An analysis before/after shows the performance improvement (also mentioning the amount of existing and queried entities). -> comment
Implementation Details
-
All
getAllAfterandgetInJurisdictionIDsmethods avoid parameter limit exception withIterableHelper.executeBatchedbatching. -
Remove or adapt not with superclass aligned
getAllAfterimplementations in:- CampaignService.getAllAfter
- CampaignFormMetaService.getAllAfter
- ContactService.getAllAfter (active contacts filter)
- EventParticipantService.getAllAfter (not archived filter)
- EventService.getAllAfter (configured
EventUserFilterCriteria) - ImmunizationService.getAllAfter
- BaseTravelEntryService.getAllAfter
-
Remove or adapt not with superclass aligned implementations parallel to
getAllAfter:- CampaignFormDataService.getAllActiveAfter
- TreatmentService.getAllActiveTreatmentsAfter
- PathogenTestService.getAllActivePathogenTestsAfter
- PrescriptionService.getAllActivePrescriptionsAfter
- SampleService.getAllActiveSamplesAfter
- ActionService.getAllActionsAfter
- AdditionalTestService.getAllActiveAdditionalTestsAfter
- ClinicalVisitService.getAllActiveClinicalVisitsAfter
- TaskService.getAllActiveTasksAfter
- CaseService.getAllActiveCasesAfter with
boolean includeExtendedChangeDateFiltersthat is never used outside of tests (introduced with #2059, partly removed by #2674).
Additional Information
- Sibling to #8946
- Batch jurisdiction processing for Persons was done in #2130 (commit 305b169).
- I tried to use CREATE INDEX CONCURRENTLY but that does not work within a transaction as it is used in the automatic db migration. If deployments get stuck because of INDEX creation (not sure what the likelyhood is), they can be created with
CREATE INDEX CONCURRENTLY IF NOT EXISTS ...before on running instance. - This ticket improves the performance of the following functions:
- Sync for android-app (
getAllAftermethods) for most to all entities. - List (
getIndexList) for AdditionalTest, PathogenTest - List (
getByUuids,getBy"Reference"/"criteria") for Case, Sample, Task, AdditionalTest, PathogenTest, Prescription, Treatment, Vaccination, Visit.
- Sync for android-app (
Performance Analysis
Preliminary analysis of performance before and after the changes introduced in PR #10282
(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)
The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:
cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04
An exception is pathogen test with no improvement / slightly worse performance:
pathogentests (1000) .........00:01:15 -> 00:01:26
The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).
Analysis data: performance_9320.zip
Detailed analysis
Full list of REST calls
Performance of REST calls (.../<entities>/all/) before the refacoring:
actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01
... and after:
actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00
Execution details for selected entities
Cases
before:

after:

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.
Contacts
before:

after:

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.
Immunizations
before:

after:

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue
Tasks
before:

after:

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.
Analysis for different batch sizes
A analysis with different batch sizes after completion of this task shows:
- request durations are substantially improved (see times above)
- there is little overhead (long running SQL queries etc.) for small batch sizes: for batch size 1, no entity exceeds a response time of 1second
- for larger batch sizes, response time is approximately linear to the batch size
- this is consistent across all entities (response time for 1000
pathogentestsis down to 15 sec now)
Response times for selected entities:
cases
(1) ....................00:00:01
(10) ...................00:00:03
(100) ..................00:00:08
(500) ..................00:00:36
(1000) .................00:01:11
(5000) .................00:05:52
persons
(1) ..................00:00:01
(10) .................00:00:00
(100) ................00:00:01
(500) ................00:00:04
(1000) ...............00:00:07
(5000) ...............00:00:33
samples
(1) ..................00:00:01
(10) .................00:00:01
(100) ................00:00:02
(500) ................00:00:10
(1000) ...............00:00:17
(5000) ...............00:01:10
Analysis data: differentBatchSizes.zip
Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since} No matter the size of the response e.g. even if only 1 task will be returned matching the time criteria. No response even for 5 minutes. The highest possible timespan that we got working on some instances is last 4 hours. Even that one takes a couple of minutes to respond. Let us know if you need more info around the issue and keep us posted about the outcome here.
Hi there, I am not sure if this is related but since recently we are having issues to get all tasks created within the last 24 hours via /tasks/all/{since}
Identifying the changed entities is what we want to improve. That includes also tasks.
Perfect! Are there already some concrete plans when can this one be expected to make its way to production?
Discussed current status and next steps with @HolgerReiseVSys and @syntakker. Results:
- Update feature branch to current development branch
- @syntakker will start with performance tests on cases, contacts, immunizations, tasks, samples as prechecks
- Integrate CaseService.getAllActiveCasesAfter as all the other sibling methods.
- Refactor
BaseAdoService.getAlltogetListto be reusable with other filters, make it protected or leave public? - Indizes for all entities that implement AdoServiceWithUserFilter
Helper used to create sort indices: IndexBuilderHelper.xlsx
Preliminary analysis of performance before and after the changes introduced in PR #10282
(tested on dafd47df2f756277323e15d28a3c6f678b1864e6)
The analysis was conducted by measuring the runtime of REST calls to the respective endpoints with batch size 1000. A substantial improvement can be observed for all entities with long response times:
cases (1000) .................00:05:28 -> 00:01:31
contacts (1000) ..............00:01:07 -> 00:00:30
immunizations (1000) .........00:01:16 -> 00:00:12
samples (1000) ...............00:01:41 -> 00:00:20
tasks (1000) .................00:01:06 -> 00:00:18
visits (1000) ................00:00:27 -> 00:00:04
An exception is pathogen test with no improvement / slightly worse performance:
pathogentests (1000) .........00:01:15 -> 00:01:26
The results based on the dataset of the initial analysis (see issue description). Further measurements will be performed as needed (e.g., retesting of individual endpoints, larger batch sizes etc.).
Analysis data: performance_9320.zip
Detailed analysis
Full list of REST calls
Performance of REST calls (.../<entities>/all/) before the refacoring:
actions (1000) ...............00:00:00
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:00
cases (1000) .................00:05:28
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:01:07
continents (1000) ............00:00:00
countries (1000) .............00:00:01
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:07
events (1000) ................00:00:04
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:01:16
pathogentests (1000) .........00:01:15
persons (1000) ...............00:00:07
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:00
regions (1000) ...............00:00:00
samples (1000) ...............00:01:41
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:01:06
treatments (1000) ............00:00:00
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:27
weeklyreports (1000) .........00:00:01
... and after:
actions (1000) ...............00:00:01
additionaltests (1000) .......00:00:01
aggregatereports (1000) ......00:00:00
areas (1000) .................00:00:00
campaigns (1000) .............00:00:01
cases (1000) .................00:01:31
clinicalvisits (1000) ........00:00:00
communities (1000) ...........00:00:00
contacts (1000) ..............00:00:30
continents (1000) ............00:00:00
countries (1000) .............00:00:00
customizableenumvalues (1000) 00:00:00
diseaseconfigurations (1000) .00:00:00
districts (1000) .............00:00:00
eventparticipants (1000) .....00:00:05
events (1000) ................00:00:02
featureconfigurations (1000) .00:00:00
immunizations (1000) .........00:00:12
pathogentests (1000) .........00:01:26
persons (1000) ...............00:00:08
pointsofentry (1000) .........00:00:00
prescriptions (1000) .........00:00:01
regions (1000) ...............00:00:00
samples (1000) ...............00:00:20
sharerequests (1000) .........00:00:00
subcontinents (1000) .........00:00:00
tasks (1000) .................00:00:18
treatments (1000) ............00:00:01
users (1000) .................00:00:00
userroles (1000) .............00:00:00
visits (1000) ................00:00:04
weeklyreports (1000) .........00:00:00
Execution details for selected entities
Cases
before:

after:

Here the calls to CaseService.inJurisdictionOrOwned per selected case have been dropped, roughly accounting for the measurde difference.
Contacts
before:

after:

As for cases, a heavy reduction of calls to ContactService.inJurisdictionOrOwned.
Immunizations
before:

after:

Here the main culprits were calls to ImmunizationService.inJurisdictionOrOwned and FeatureConfigurationFacadeEjb.isPropertyValueTrue
Tasks
before:

after:

Here, calls to TaskService.inJurisdictionOrOwned on a per object basis accounted for most of the time spent.
Changes on performance tooling still open