Aggregated endpoint is slower than expected
Opening the CoV-Spectrum landing page, the website sends 19 requests to the /sample/aggregated endpoint. In all cases, the client has to wait >100ms for the server response, sometimes even over 700ms. According to @Taepper, SILO should only require a few ms.
What is causing the delay? Is it the proxying through the two servers? Is it LAPIS?
(My ping to lapis.cov-spectrum.org is 33ms.)
Dev console screenshots
Request IDs:
- ff985605-1a26-4670-8aa7-2f04ca868d17 (790ms)
- 55a8cc02-fafe-4541-8e1b-609622f41402 (600ms)
- 5a5b4c0f-9f15-4cab-b921-be5c68b7d812 (125ms)
What we found so far:
- We need to save the server logs to file (we could not find your request IDs in the docker logs) #692
- SILO takes around a few ms to answer the call from LAPIS (taken from the LAPIS logs)
- LAPIS internally takes also only a few ms (taken from the LAPIS logs)
- The measurement starts only, when the request reaches our code and leaves it again. We dont know how long spring takes.
Next possible steps:
- check request on test server
- look into nginx logs
- check whether we find the same timing of the requests
- reintroduce dateToSortBy
- already started preprocessing on testserver
- if it works add this to the config of production
The reintroduce dateToSortBy would only lower the load of the server, SILO is running on.
If this contention is possibly causing Spring to be slower this might be viable, but given that both LAPIS and SILO only take a few ms makes me wonder whether this would resolve the problem
I did a bit of timing: see https://github.com/GenSpectrum/LAPIS/issues/683#issuecomment-1978809282
Here is what I get for a quite simple aggregated request. Note that the times for the server are actually in seconds and not milliseconds.
It shows that in this example, Nginx had to wait 210ms "between establishing a connection to an upstream server [LAPIS] and receiving the first byte of the response header"
(The difference between "waiting for server response" and the values from the server timing API is, I believe, well explained by the RTT between my laptop and the server.)
A request to retrieve cached (amino acid mutations) results is much faster:
After activating the cache for aggregated queries in #695, the response time, when I directly connect to the s0 server, dropped down to basically the RTT:
If we connect to the web server/lapis.cov-spectrum.org, it needs 70-80ms longer:
For this ticket, the remaining task is to find out why the aggregated endpoint, when uncached, needs a few hundred ms if SILO+"our LAPIS code" only need a few ms.