nats-queue-worker icon indicating copy to clipboard operation
nats-queue-worker copied to clipboard

Queue Worker does not gracefully shut down

Open kevin-lindsay-1 opened this issue 4 years ago • 3 comments

In a previous conversion @alexellis and I discussed some items related to the queue worker, one of which being to verify whether or not the queue worker gracefully shuts down, or if it just abandons its work.

Expected Behaviour

The behavior we discussed that we desired was that the queue worker attempts to gracefully shut down by:

  • stop subscribing to new messages from nats
  • finishing up all invocations it had started as normal
  • when no longer working on any invocations, exit 0

An example of this timing for a sleep function with the following config:

  • sleep duration of 30s
  • [x]_timeouts of 1m
  • queue worker with an ack_wait of 1m5s

We assume a kubernetes environment or environment with a similar orchestration layer and pattern to kubernetes, and we assume the event triggering the pod is a graceful shutdown command, such as a Node draining for maintenance and scheduling resources on a different Node.

Expecting events with rough timing; the sections in the format [duration] are the general timings from the start of this example timeline:

  • async function is invoked via gateway and sent to nats [0s]
  • queue worker receives a message from nats subscription [0s]
  • queue worker invokes sleep function, which is configured to sleep for 30 seconds [0s]
  • queue worker receives SIGTERM(via drain), a new queue worker is scheduled to replace it [5s]
  • queue worker stops subscribing to new messages [5s]
  • new queue worker comes online, subscribes [7s]
  • function invocation completes [30s]
  • queue worker receives response and handles as normal [30s]
  • queue worker notices it has no more invocations to wait for, and exits with a status code of 0 [30s]
  • queue worker pod is removed [30s]

Current Behaviour

Currently the queue worker immediately exits, I don't even see a log such as "received SIGTERM" or the like. Once the queue-worker comes back online, nats eventually sends the message again.

An example of this timing with the same settings and format as above, functional (non-timing) differences in bold italics:

  • async function is invoked via gateway and sent to nats [0s]
  • queue worker receives a message from nats subscription [0s]
  • queue worker invokes sleep function, which is configured to sleep for 30 seconds [0s]
  • queue worker receives SIGTERM(via drain), a new queue worker is scheduled to replace it [5s]
  • queue worker immediately exits [5s]
  • new queue worker comes online, subscribes [7s]
  • function invocation completes, but is not handled by anything [30s]
  • original invocation is considered a "miss" and resent by nats [1m5s]
  • new queue worker receives a message from nats [1m5s]
  • new queue worker invokes sleep function again, which is configured to sleep for 30 seconds [1m5s]
  • second function invocation completes [1m35s]
  • new queue worker receives response and handles as normal [1m35s]

The two major differences from the above:

  • 2 function invocations occurred
  • the overall time was extended by the ack_wait duration, meaning a process that should take 30s instead takes 1m35s (function duration + ack_wait duration)

Possible Solution

Steps to Reproduce (for bugs)

Context

We are interested in the timing of jobs, as well as not duplicating function invocations, if graceful shutdown were implemented, we could expect certain invocations to not wait for the full ack_wait duration before attempting the function again.

Your Environment

  • FaaS-CLI version ( Full output from: faas-cli version ): 0.13.13

  • Docker version docker version (e.g. Docker 17.0.05 ): 20.10.8

  • What version and distriubtion of Kubernetes are you using? kubectl version server v1.21.3 client v1.22.2

  • Operating System and version (e.g. Linux, Windows, MacOS): MacOS

  • Link to your project or a code example to reproduce issue:

  • What network driver are you using and what CIDR? i.e. Weave net / Flannel

kevin-lindsay-1 avatar Nov 02 '21 20:11 kevin-lindsay-1

Hi @kevin-lindsay-1 - we reviewed this on the office hours call, do you have steps for a repro please?

Steps to Reproduce (for bugs)

Alex

alexellis avatar Feb 16 '22 16:02 alexellis

I made this a while ago. What is there to repro? I don't think it gracefully shuts down as described at all, nor is it supposed to right now to my knowledge. So, unless the queue worker shouldn't exit when it's invoking a function, there is nothing to repro.

As far as I know, it doesn't gracefully shutdown as described, full stop. I've been watching these queue workers for over a year now and I don't think it's ever once initiated any kind of behavior that pointed towards graceful shutdown being implemented.

kevin-lindsay-1 avatar Feb 17 '22 19:02 kevin-lindsay-1

What is there to repro? I don't think it gracefully shuts down as described at all, nor is it supposed to right now to my knowledge. So, unless the queue worker shouldn't exit when it's invoking a function, there is nothing to repro.

I'd say: something that proves that this behaviour is the case, and what impact a non-graceful shutdown might have?

What would the minimum useful setup be to demo its impact and suggest what benefits a graceful shutdown could bring?

alexellis avatar Jun 17 '22 20:06 alexellis