server icon indicating copy to clipboard operation
server copied to clipboard

Apollo docker restarts on heavy load

Open roey-navina opened this issue 4 years ago • 2 comments

Description

When running multiple concurrent flows, each with many tasks, might cause a memory leak in the apollo docker that causes it to restart itself.

Example error inside the apollo docker:

<--- Last few GCs ---> [23:0x54a57c0] 80569 ms: Scavenge (reduce) 2043.9 (2050.3) -> 2043.2 (2051.3) MB, 2.0 / 0.0 ms (average mu = 0.138, current mu = 0.003) allocation failure [23:0x54a57c0] 80574 ms: Scavenge (reduce) 2044.1 (2050.3) -> 2043.4 (2051.3) MB, 2.0 / 0.0 ms (average mu = 0.138, current mu = 0.003) allocation failure [23:0x54a57c0] 80578 ms: Scavenge (reduce) 2044.3 (2050.3) -> 2043.6 (2051.6) MB, 2.0 / 0.0 ms (average mu = 0.138, current mu = 0.003) allocation failure <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory 1: 0xa03530 node::Abort() [node] 2: 0x94e471 node::FatalError(char const*, char const*) [node] 3: 0xb7773e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node] 4: 0xb77ab7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node] 5: 0xd32345 [node] 6: 0xd32ecf [node] 7: 0xd40f5b v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node] 8: 0xd44b1c v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node] 9: 0xd131fb v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node] 10: 0x105919f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node] 11: 0x13ff179 [node] Aborted (core dumped) npm ERR! code ELIFECYCLE npm ERR! errno 134 npm ERR! @ serve: node dist/index.js npm ERR! Exit status 134 npm ERR! npm ERR! Failed at the @ serve script. npm ERR! This is probably not a problem with npm. There is likely additional logging output above. npm ERR! A complete log of this run can be found in: npm ERR! /root/.npm/_logs/2021-06-22T15_25_42_163Z-debug.log

Expected Behavior

The docker should not fail on running a reasonable number of concurrent flows.

Reproduction

Run a stress test on the prefect server with 100 flows, each flow with 100 tasks.

Environment

Prefect version 0.14.22

roey-navina avatar Jun 28 '21 12:06 roey-navina

Hi @roey-navina -- could you clarify what you're talking about with telemetry? Where are you running your flow runs? This looks like your concurrent flow runs were sending enough data that they consumed all the memory available to your container. Prefect Server is not intended to scale to hundreds of concurrent flow runs, but if you want it to you'll likely need to allocate additional memory for the stack.

zanieb avatar Jun 28 '21 14:06 zanieb

Increasing the memory for the Apollo server (using node.js max_old_heap_size parameter) caused it to throw this error (without restarting itself).

PayloadTooLargeError: request entity too large at readStream (/apollo/node_modules/raw-body/index.js:155:17) at getRawBody (/apollo/node_modules/raw-body/index.js:108:12) at read (/apollo/node_modules/body-parser/lib/read.js:77:3) at jsonParser (/apollo/node_modules/body-parser/lib/types/json.js:135:5) at Layer.handle [as handle_request] (/apollo/node_modules/express/lib/router/layer.js:95:5) at trim_prefix (/apollo/node_modules/express/lib/router/index.js:317:13) at /apollo/node_modules/express/lib/router/index.js:284:7 at Function.process_params (/apollo/node_modules/express/lib/router/index.js:335:12) at next (/apollo/node_modules/express/lib/router/index.js:275:10) at cors (/apollo/node_modules/cors/lib/index.js:188:7)

Besides this limitation:

Prefect Server is not intended to scale to hundreds of concurrent flow runs

What other limits does Prefect Server has? Is there any documentation on that? Is there a more accurate upper limit that we can rely on other than "hundreds"? What is the payload that is too large for the Apollo server to handle?

roey-navina avatar Jun 28 '21 19:06 roey-navina