Apollo docker restarts on heavy load
Description
When running multiple concurrent flows, each with many tasks, might cause a memory leak in the apollo docker that causes it to restart itself.
Example error inside the apollo docker:
<--- Last few GCs ---> [23:0x54a57c0] 80569 ms: Scavenge (reduce) 2043.9 (2050.3) -> 2043.2 (2051.3) MB, 2.0 / 0.0 ms (average mu = 0.138, current mu = 0.003) allocation failure [23:0x54a57c0] 80574 ms: Scavenge (reduce) 2044.1 (2050.3) -> 2043.4 (2051.3) MB, 2.0 / 0.0 ms (average mu = 0.138, current mu = 0.003) allocation failure [23:0x54a57c0] 80578 ms: Scavenge (reduce) 2044.3 (2050.3) -> 2043.6 (2051.6) MB, 2.0 / 0.0 ms (average mu = 0.138, current mu = 0.003) allocation failure <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory 1: 0xa03530 node::Abort() [node] 2: 0x94e471 node::FatalError(char const*, char const*) [node] 3: 0xb7773e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node] 4: 0xb77ab7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node] 5: 0xd32345 [node] 6: 0xd32ecf [node] 7: 0xd40f5b v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node] 8: 0xd44b1c v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node] 9: 0xd131fb v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node] 10: 0x105919f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node] 11: 0x13ff179 [node] Aborted (core dumped) npm ERR! code ELIFECYCLE npm ERR! errno 134 npm ERR! @ serve:
node dist/index.jsnpm ERR! Exit status 134 npm ERR! npm ERR! Failed at the @ serve script. npm ERR! This is probably not a problem with npm. There is likely additional logging output above. npm ERR! A complete log of this run can be found in: npm ERR! /root/.npm/_logs/2021-06-22T15_25_42_163Z-debug.log
Expected Behavior
The docker should not fail on running a reasonable number of concurrent flows.
Reproduction
Run a stress test on the prefect server with 100 flows, each flow with 100 tasks.
Environment
Prefect version 0.14.22
Hi @roey-navina -- could you clarify what you're talking about with telemetry? Where are you running your flow runs? This looks like your concurrent flow runs were sending enough data that they consumed all the memory available to your container. Prefect Server is not intended to scale to hundreds of concurrent flow runs, but if you want it to you'll likely need to allocate additional memory for the stack.
Increasing the memory for the Apollo server (using node.js max_old_heap_size parameter) caused it to throw this error (without restarting itself).
PayloadTooLargeError: request entity too large at readStream (/apollo/node_modules/raw-body/index.js:155:17) at getRawBody (/apollo/node_modules/raw-body/index.js:108:12) at read (/apollo/node_modules/body-parser/lib/read.js:77:3) at jsonParser (/apollo/node_modules/body-parser/lib/types/json.js:135:5) at Layer.handle [as handle_request] (/apollo/node_modules/express/lib/router/layer.js:95:5) at trim_prefix (/apollo/node_modules/express/lib/router/index.js:317:13) at /apollo/node_modules/express/lib/router/index.js:284:7 at Function.process_params (/apollo/node_modules/express/lib/router/index.js:335:12) at next (/apollo/node_modules/express/lib/router/index.js:275:10) at cors (/apollo/node_modules/cors/lib/index.js:188:7)
Besides this limitation:
Prefect Server is not intended to scale to hundreds of concurrent flow runs
What other limits does Prefect Server has? Is there any documentation on that? Is there a more accurate upper limit that we can rely on other than "hundreds"? What is the payload that is too large for the Apollo server to handle?