framework icon indicating copy to clipboard operation
framework copied to clipboard

ovs-workers and ovs-webapp-api not restarted after being killed via ovs-watcher-framework going down

Open dejonghb opened this issue 8 years ago • 11 comments

On serveral occasions, for some (yet unknown) reason the ovs-watcher-framework goes down (memcache?) which takes along ovs-workers and ovs-webapp-api

After some retries ovs-watcher-framework gets restarted (ovs-watcher-framework.service: Service hold-off time over, scheduling restart) but the ovs-workers and ovs-webapp-api remain inactive.

Filtered syslog attached (hope there's still enough info in there) syslog-filtered.gz

dejonghb avatar Jul 14 '17 12:07 dejonghb

Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: State 'stop-sigterm' timed out. Killing.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Main process exited, code=killed, status=9/KILL
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Open vStorage workers.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Unit entered failed state.
Jul 14 10:25:53 ftcmp02 systemd[1]: ovs-workers.service: Failed with result 'signal'.
Jul 14 10:25:53 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 10:25:53 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 10:25:54 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 10:28:49 ftcmp02 systemd[1]: Starting Cleanup of Temporary Directories...

dejonghb avatar Jul 14 '17 12:07 dejonghb

Maybe the watcher goes down because one service is missing (as designed) and takes along all framework services (as designed). However, based on the logging, it looks like the workers don't stop fast enough? Maybe that's the cause why they don't start again once the wacher is restarted.

khenderick avatar Jul 14 '17 12:07 khenderick

Jul 14 11:45:07 ftcmp02 systemd[1]: Stopped User Manager for UID 1001.
Jul 14 11:45:07 ftcmp02 systemd[1]: Removed slice User Slice of ovs.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Unit entered failed state.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Failed with result 'exit-code'.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopping Open vStorage volumerouter consumer...
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Open vStorage volumerouter consumer.
Jul 14 11:56:52 ftcmp02 systemd[1]: ovs-watcher-framework.service: Service hold-off time over, scheduling restart.
Jul 14 11:56:52 ftcmp02 systemd[1]: Stopped Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Watcher for Open vStorage master services...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Watcher for Open vStorage master services.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage scheduled tasks.
Jul 14 11:56:52 ftcmp02 systemd[1]: Starting Open vStorage workers...
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage webapp API.
Jul 14 11:56:52 ftcmp02 systemd[1]: Started Open vStorage volumerouter consumer.
Jul 14 11:56:53 ftcmp02 systemd[1]: Started Open vStorage workers.

--> trying to restart workers

dejonghb avatar Jul 14 '17 12:07 dejonghb

After discussions with @dejonghb we figured following possible solution:

  • Before self-exiting with e.g. sys.exit(1), we touch a temp file, e.g. /tmp/watcher.
  • On pre-start
  • If the file exist, we first wait (max 15 minutes) for all depending services to be in a halted state. Else, we just continue
  • We wait for the master services to be up again
  • Then the usual flow continues.

This should make sure that all depending services are down before it gets started again, being the possible cause of the issue we see here.

khenderick avatar Jul 14 '17 13:07 khenderick

Also remove the temp file when startup succeeds ;-) Maybe also check timestamps on the tempfile so leftovers can be ignored?

dejonghb avatar Jul 14 '17 13:07 dejonghb

Makes sense, an automatic restart shouldn't take more than a few seconds.

khenderick avatar Jul 14 '17 13:07 khenderick

@dejonghb Doesn't this duplicate https://github.com/openvstorage/framework/issues/1568 ?

wimpers avatar Aug 17 '17 15:08 wimpers

#1568 is indeed related but more logging and info on what could be the problem + workaround in here, so guess #1568 can be closed.

dejonghb avatar Aug 18 '17 07:08 dejonghb

It could be that the exit code is actually 0, a clean shutdown. We should set the restart to always instead of on failure to restart in all cases

JeffreyDevloo avatar May 14 '18 10:05 JeffreyDevloo

While testing I've noticed some behavior that might indicate this issue:

  • Bring watcher down
    • Brings down web-api & workers
    • web-api in deactivating state & workers in inactive state
  • Make sure watcher recovers
    • Workers were started
    • web-api is now inactive

Further looking into a fix.

JeffreyDevloo avatar Jul 31 '18 16:07 JeffreyDevloo

Test report

Setup

The testing setup is as follows: test_main.service: a simple overlooking service which can be bound onto by others (mimicks the watcher)

[Unit]
Description=Application Control

[Service]
# Execute a dummy program. Fail after 10 seconds and make sure the second component cant start
ExecStart=/bin/sleep infinity
Restart=on-failure

[Install]
# Components of this application should be started at boot time
WantedBy=multi-user.target

test_component_1.service: A simple component which binds onto the test_main (simulates either webapp-api or worker) This service is controllable through files

[Unit]
Description=Application Component 1
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the test_main.service start
After=test_main.service

[Service]
# Controllable start: Pretend that the component is running
ExecStart=/bin/bash -c 'while [ ! -f /tmp/test_component_1.start.lock ]; do sleep 2; done; exit 1;'
# Controllable de-activating state
ExecStopPost=/bin/bash -c 'while [ ! -f /tmp/test_component_1.lock ]; do sleep 2; done'
# Do it with python logic
# ExecStart=/usr//bin/python /tmp/test_component_1.py
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure


[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service

test_component_2.service: A simple component which binds onto the test_main (simulates either webapp-api or worker). This component will always follow the test_main state

[Unit]
Description=Application Component 2
# When systemd stops or restarts the test_main.service, the action is propagated to this unit
BindsTo=test_main.service
# Start this unit after the app.service start
After=test_main.service

[Service]
# Pretend that the component is running
ExecStart=/bin/sleep infinity
# Restart the service on non-zero exit code when terminated by a signal other than SIGHUP, SIGINT, SIGTERM or SIGPIPE
Restart=on-failure

[Install]
# This unit should start when test_main.service is starting
WantedBy=test_main.service

Sending signal 9

When signal 9 is sent to the watcher, both workers and webapp-api go down. When one of these is in the deactivating state and the watcher gets restarted manually, that service won't start anymore

This is just an observation, nothing to do with this ticket (since watcher got killed with exit 1 by itself)

Watcher failure

Just like the ticket: the watcher service should exit with 1. The restart section should then schedule the service to restart

All components start without failure

When all components start without failing: everything seems to be back in order.

Single component going in deactivating

The test_main service waits until the deactivating state is resolved before starting itself.

Components fails to start

If a component fails to start during the restart. That component will be stuck in failure state while the test_main component is still running. This seems to be state of the ticket.

Hypothesis

  • Watcher died because a main service was not responding (exit 1)
    • Workers and webapp-api went down due to the BindsTo dependency
  • Watcher restarts because of the failure state
    • Workers tries to restart. Fails 5x (default retry is only 5x)
    • Workers in a failed state
  • Watcher is restarted while workers retry and eventually workers go in a failed state

If this is the case: not much can be done in Systemd. I'd suggest fixing the reason why the worker cannot start while the watcher can. The watcher checks if all services are available for use.

JeffreyDevloo avatar Aug 01 '18 10:08 JeffreyDevloo