salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] Publishing large event data from scheduled highstate to master via ipc socket hangs and breaks minion in 3005.1 (and 3007.0)

Open rjel2159 opened this issue 1 year ago • 5 comments

Description

When sending large event data from a scheduled highstate back to the master over ipc socket, salt-minion child process becomes defunct and we see over 1600 connections to the master (port 4506) in TIME_WAIT. Running salt-minion with trace, we see the following before the hang:

[DEBUG   ] SaltEvent PUB socket URI: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pull.ipc
[TRACE   ] IPCClient: Connecting to socket: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion1234', 'fun': 'highstate.run', 'fun_args': ['minion_start'], 'schedule': 'highstate__minion_start', 'jid': '20240508161342306002', 'pid': 10847, 'return': {'highstate_type': 'minion_start', 'salt_boot_done': 'Tuesday February 01 2022 14:53', 'start_time': '2024-05-08T16:13:42.411586', 'file_|-bb.patch ...'retcode': 0, 'success': True, '_stamp': '2024-05-08T16:14:12.159598', 'out': 'highstate'}
<hang>
^C
[TRACE   ] Waiting to kill process manager children
[DEBUG   ] Closing IPCMessageClient instance
[DEBUG   ] Closing IPCMessageSubscriber instance
[WARNING ] Minion received a SIGINT. Exiting.
[INFO    ] Shutting down the Salt Minion
[TRACE   ] Processing <function DaemonMixIn._mixin_before_exit at 0x7f3998635280>
[TRACE   ] Processing <function LogLevelMixIn.__shutdown_logging at 0x7f3998630ca0>
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
The minion failed to return the job information for job 20240520183520712612. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Future <salt.ext.tornado.concurrent.Future object at 0x7f3988a36760> exception was never retrieved: Traceback (most recent call last):
  File "/local/opt/saltcrystal/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
    yielded = next(result)
  File "/local/opt/saltcrystal/lib/python3.9/site-packages/salt/minion.py", line 2921, in handle_event
    self._return_pub(data, ret_cmd="_return", sync=False)
  File "/local/opt/saltcrystal/lib/python3.9/site-packages/salt/minion.py", line 2263, in _return_pub
    log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
UnboundLocalError: local variable 'ret_val' referenced before assignment
^CMinion received a SIGINT. Exiting.
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
^CMinion received a SIGINT. Exiting.
The Salt Minion is shutdown. Minion received a SIGINT. Exited.

And the process' thread_1 is stuck on a lock in do_futex_wait.constprop.1():

[root@d400241-080 gs.d]# pstack 4034|grep -A10 ^"Thread 1"
Thread 1 (Thread 0x7fba8285d740 (LWP 4034)):
#0  0x00007fba82220b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007fba82220bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fba82220c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00000000005413a9 in PyThread_acquire_lock_timed (lock=lock@entry=0x7fba50000dc0, microseconds=microseconds@entry=-1000000, intr_flag=intr_flag@entry=1) at Python/thread_pthread.h:483
#4  0x000000000059c614 in acquire_timed (timeout=-1000000000, lock=0x7fba50000dc0) at ./Modules/_threadmodule.c:63
#5  lock_PyThread_acquire_lock (self=0x7fba60ec0ea0, args=<optimized out>, kwds=<optimized out>) at ./Modules/_threadmodule.c:146
#6  0x00000000005dd63b in method_vectorcall_VARARGS_KEYWORDS (func=0x7fba82806540, args=0x7fba61f331d0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/descrobject.c:348
#7  0x0000000000425303 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>) at ./Include/cpython/abstract.h:118
#8  PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>) at ./Include/cpython/abstract.h:127
#9  trace_call_function (kwnames=<optimized out>, nargs=<optimized out>, args=<optimized out>, func=<optimized out>, tstate=<optimized out>) at Python/ceval.c:5058
[root@d400241-080 gs.d]# pstack 4034|grep -A3 ^"Thread"
Thread 7 (Thread 0x7fba70cbe700 (LWP 4036)):
#0  0x00007fba81831b43 in select () from /lib64/libc.so.6
#1  0x0000000000599e02 in pysleep (secs=<optimized out>) at ./Modules/timemodule.c:2036
#2  time_sleep (self=<optimized out>, obj=<optimized out>) at ./Modules/timemodule.c:365
--
Thread 6 (Thread 0x7fba61723700 (LWP 9589)):
#0  0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2  0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 5 (Thread 0x7fba61f24700 (LWP 9590)):
#0  0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2  0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 4 (Thread 0x7fba60d22700 (LWP 9602)):
#0  0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2  0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 3 (Thread 0x7fba5bfff700 (LWP 9603)):
#0  0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fba72f4125f in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
#2  0x00007fba72f608a9 in ?? () from /local/opt/saltcrystal/lib/python3.9/site-packages/zmq/backend/cython/../../../pyzmq.libs/libzmq-f3e05bef.so.5.2.3
--
Thread 2 (Thread 0x7fba5affd700 (LWP 16840)):
#0  0x00007fba8183b0e3 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fba7a14d1ea in select_epoll_poll_impl (self=0x7fba687eca10, maxevents=1023, timeout_obj=<optimized out>) at /builds/cme/saltcrystal/build/downloads/Python-3.9.16/Modules/selectmodule.c:1613
#2  select_epoll_poll (self=0x7fba687eca10, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /builds/cme/saltcrystal/build/downloads/Python-3.9.16/Modules/clinic/selectmodule.c.h:871
--
Thread 1 (Thread 0x7fba8285d740 (LWP 4034)):
#0  0x00007fba82220b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007fba82220bcf in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fba82220c6b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
[root@d400241-080 gs.d]#

Strace output before the hang:

21397 write(2, "[TRACE   ] IPCClient: Connecting"..., 123) = 123
21397 connect(55, {sa_family=AF_UNIX, sun_path="/local/opt/configmgt/salt/var/run/salt/minion/minion_event_e4a410a83f_pull.ipc"}, 81) = 0
21397 epoll_ctl(51, EPOLL_CTL_ADD, 55, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=55, u64=19772122465304631}}) = 0
21397 epoll_wait(51, [{EPOLLIN, {u32=53, u64=19772122465304629}}, {EPOLLOUT, {u32=55, u64=19772122465304631}}], 1023, 3600000) = 2
21397 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
21397 epoll_ctl(51, EPOLL_CTL_MOD, 55, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=55, u64=532575944759}}) = 0
21397 lseek(53, 0, SEEK_CUR)            = -1 ESPIPE (Illegal seek)
21397 fstat(53, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
21397 read(53, "x", 8192)               = 1
21397 read(53, 0x7f3970001041, 8191)    = -1 EAGAIN (Resource temporarily unavailable)
21397 lseek(53, 0, SEEK_CUR)            = -1 ESPIPE (Illegal seek)
21397 fstat(53, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
21397 read(53, 0x7f3970001070, 8192)    = -1 EAGAIN (Resource temporarily unavailable)
21397 epoll_wait(51, [], 1023, 0)       = 0
21397 write(54, "x", 1)                 = 1
21397 futex(0x7f3970000dc0, FUTEX_WAKE_PRIVATE, 1) = 1
10157 <... futex resumed>)              = 0
21397 madvise(0x7f397affe000, 8368128, MADV_DONTNEED) = 0
21397 exit(0)                           = ?
10157 brk(NULL <unfinished ...>
21397 +++ exited with 0 +++
10157 <... brk resumed>)                = 0x24ad000
10157 brk(0x25ae000)                    = 0x25ae000
10157 brk(NULL)                         = 0x25ae000
10157 brk(0x2637000)                    = 0x2637000
10157 brk(NULL)                         = 0x2637000
10157 brk(0x26f1000)                    = 0x26f1000
10157 brk(NULL)                         = 0x26f1000
10157 brk(0x2786000)                    = 0x2786000
10157 brk(NULL)                         = 0x2786000
10157 brk(0x27f5000)                    = 0x27f5000
10157 brk(NULL)                         = 0x27f5000
10157 brk(0x288a000)                    = 0x288a000
10157 write(2, "[DEBUG   ] Sending event: tag = "..., 609070) = 609070
10157 clone( <unfinished ...>
21412 set_robust_list(0x7f397b7fe9e0, 24 <unfinished ...>
10157 <... clone resumed>child_stack=0x7f397b7fdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f397b7fe9d0, tls=0x7f397b7fe700, child_tidptr=0x7f397b7fe9d0) = 21412
21412 <... set_robust_list resumed>)    = 0
21412 futex(0x999d24, FUTEX_WAIT_BITSET_PRIVATE, 123, {tv_sec=6453115, tv_nsec=359445641}, 0xffffffff <unfinished ...>
10157 futex(0x999d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999d20, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1 <unfinished ...>
21412 <... futex resumed>)              = -1 EAGAIN (Resource temporarily unavailable)
10157 <... futex resumed>)              = 0
21412 futex(0x999d50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
10157 futex(0x999d50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
21412 <... futex resumed>)              = -1 EAGAIN (Resource temporarily unavailable)
10157 <... futex resumed>)              = 0
21412 futex(0x999d50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
10157 futex(0x1fe5eb0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
21412 <... futex resumed>)              = 0
21412 gettid()                          = 21412
21412 futex(0x1fe5eb0, FUTEX_WAKE_PRIVATE, 1) = 1
10157 <... futex resumed>)              = 0
10157 futex(0x999d24, FUTEX_WAIT_BITSET_PRIVATE, 125, {tv_sec=6453115, tv_nsec=359673957}, 0xffffffff <unfinished ...>
21412 mprotect(0x7f3970101000, 4096, PROT_READ|PROT_WRITE) = 0
21412 mprotect(0x7f3970102000, 561152, PROT_READ|PROT_WRITE) = 0
21412 futex(0x999d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999d20, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1
10157 <... futex resumed>)              = 0
21412 sendto(55, "\202\244head\200\244body\306\0\10\227E__master_req_ch"..., 563030, 0, NULL, 0 <unfinished ...>
10157 futex(0x999d50, FUTEX_WAKE_PRIVATE, 1) = 0
10157 futex(0x7f3970000dc0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
21412 <... sendto resumed>)             = 219264
21412 sendto(55, "mqsrejaegerqa,mqsreproberqa,mqsr"..., 343766, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
21412 epoll_ctl(51, EPOLL_CTL_MOD, 55, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=55, u64=532575944759}}) = 0
21412 epoll_wait(51, [{EPOLLIN, {u32=53, u64=19772122465304629}}], 1023, 3600000) = 1
21412 lseek(53, 0, SEEK_CUR)            = -1 ESPIPE (Illegal seek)
21412 fstat(53, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
21412 read(53, "x", 8192)               = 1
21412 read(53, 0x7f397008a9f1, 8191)    = -1 EAGAIN (Resource temporarily unavailable)
21412 lseek(53, 0, SEEK_CUR)            = -1 ESPIPE (Illegal seek)
21412 fstat(53, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
21412 read(53, 0x7f397008aa20, 8192)    = -1 EAGAIN (Resource temporarily unavailable)
21412 epoll_wait(51,  <unfinished ...>
10158 <... select resumed>)             = 0 (Timeout)
10158 getuid()                          = 0
10158 kill(10150, SIG_0)                = 0
10158 select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0} <unfinished ...>
10477 <... epoll_wait resumed>[{EPOLLIN, {u32=2080378432, u64=139884870241856}}], 256, -1) = 1
10477 recvfrom(18, "\2\0\0\0\0\0\0\1\370\203\243enc\243aes\244load\332\0\340\307\217\226n\215\311"..., 8192, 0, NULL, NULL) = 513
10477 sendto(25, "\0", 1, 0, NULL, 0)   = 1
10477 epoll_wait(24,  <unfinished ...>
10158 <... select resumed>)             = 0 (Timeout)
10158 getuid()                          = 0
10158 kill(10150, SIG_0)                = 0
10158 select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0} <unfinished ...>
10477 <... epoll_wait resumed>[{EPOLLIN, {u32=2080378432, u64=139884870241856}}], 256, -1) = 1
10477 recvfrom(18, "\2\0\0\0\0\0\0\1\370\203\243enc\243aes\244load\332\0\340\307\217\226n\215\311"..., 8192, 0, NULL, NULL) = 513
10477 sendto(25, "\0", 1, 0, NULL, 0)   = 1
10477 epoll_wait(24,  <unfinished ...>
10158 <... select resumed>)             = 0 (Timeout)
10158 getuid()                          = 0
10158 kill(10150, SIG_0)                = 0
10158 select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
10158 getuid()                          = 0
10158 kill(10150, SIG_0)                = 0
10158 select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
10158 getuid()                          = 0
10158 kill(10150, SIG_0)                = 0
10158 select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
10158 getuid()                          = 0
10158 kill(10150, SIG_0)                = 0

When the hightstate event doesn't hang (i.e. doesn't contain large amount of data) , we see the following instead:

[DEBUG   ] SaltEvent PUB socket URI: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_4ad9b01fe6_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_4ad9b01fe6_pull.ipc
[TRACE   ] IPCClient: Connecting to socket: /local/opt/configmgt/salt/var/run/salt/minion/minion_event_4ad9b01fe6_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion1234', 'fun': 'highstate.run', 'fun_args': ['minion_start'], 'schedule': 'highstate__minion_start', 'jid': '20240508160241269142', 'pid': 6287, 'return': {'highstate_type': 'minion_start', 'salt_boot_done': 'Tuesday December 26 2023 12:26', 'start_time': '2024-05-08T16:02:41.359254', 'file_|-bb.patch /local/opt/configmgt/scripts/md5checker.sh_|-/local/opt/configmgt/scripts/md5checker.sh_|-managed': {'changes': {}, 'comment': 'File /local/opt/configmgt/scripts/md5checker.sh is in the correct state', 'name': '/local/opt/configmgt/scripts/md5checker.sh', 'result': True, '__sls__': 'bb.patch.patch', ..., 'retcode': 0, 'success': True, '_stamp': '2024-05-08T16:03:01.830912', 'out': 'highstate'}
[TRACE   ] ret_val = <salt.ext.tornado.concurrent.Future object at 0x7f1128c10070>
[TRACE   ] IPCServer: Handling connection to address: <socket.socket fd=56, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/local/opt/configmgt/salt/var/run/salt/minion/minion_event_4ad9b01fe6_pull.ipc>
[DEBUG   ] Closing IPCMessageClient instance
[TRACE   ] Client disconnected from IPC /local/opt/configmgt/salt/var/run/salt/minion/minion_event_4ad9b01fe6_pull.ipc
[DEBUG   ] Minion of 'master1234' is handling event tag '__master_req_channel_payload'
[DEBUG   ] Minion return retry timer set to 60 seconds
[TRACE   ] ReqChannel send crypt load={'cmd': '_return', 'id': 'minion123', 'fun': 'highstate.run', 'fun_args': ['minion_start'], 'schedule': 'highstate__minion_start', 'jid': '20240508160241269142', 'pid': 6287, 'return': {'highstate_type': 'minion_start', 'salt_boot_done': 'Tuesday December 26 2023 12:26', 'start_time': '2024-05-08T16:02:41.359254', 'file_|-bb.patch ... 'retcode': 0, 'success': True, '_stamp': '2024-05-08T16:03:01.830912', 'out': 'highstate'}
[TRACE   ] ==== evaluating schedule now None =====
[DEBUG   ] Subprocess Schedule(name=highstate__minion_start, jid=20240508160241269142) cleaned up
[TRACE   ] ==== evaluating schedule now None =====

More debugging info from minion running 3005.1 and 3007.0 attached. Both of these versions hang the same way, but 3002.2 works just fine.

We see no issues when switching ipc_mode to 'tcp', or setting ipc_write_buffer to something below 1000 in the minion configuration.

Additional context

Might be related to https://github.com/saltstack/salt/issues/65940 #65940

rjel2159 avatar May 20 '24 17:05 rjel2159

I'd like to bump this as we are running into this issue as well.

tazaki avatar Jul 31 '24 16:07 tazaki

Same in 3006.4

gregorg avatar Aug 23 '24 08:08 gregorg

No issue with ipc_mode set to tcp. However this setting is reserved to Windows machines, and this issue is facing Linux machines.

gregorg avatar Aug 26 '24 07:08 gregorg

Same in 3006.9.

msciciel avatar Sep 10 '24 12:09 msciciel

@dwoz Not sure if your fix is also believed to resolve the issues pointed in #65265 but after applying it to a 3006.9 minion in multimaster environment, I am getting the following minion log entries when attempting to run a command from the master that is not receiving minion responses:

2025-02-26 17:01:18,800 [salt.minion      :2836][ERROR   ][5843] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20250226170051099197', 'fun': 'test.ping', 'fun_args': [], 'user': 'myself', '_stamp': '2025-02-26T17:00:51.790631', 'nonce': '217c6adf337846388d6dffaa0730620d'} request. id=ec64997a-4025-45fc-9329-e64894bddf00

 2025-02-26 17:01:51,803 [salt.utils.process:999 ][ERROR   ][37937] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250226170051099197)' was caught:
    Traceback (most recent call last):
      File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 994, in wrapped_run_func
        return run_func()
      File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
        self._target(*self._args, **self._kwargs)
      File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1925, in _target
        run_func(minion_instance, opts, data)
      File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1919, in run_func
        return Minion._thread_return(minion_instance, opts, data)
      File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2155, in _thread_return
        minion_instance._return_pub(ret)
      File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2383, in _return_pub
        ret_val = self._send_req_sync(load, timeout=timeout)
      File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1648, in _send_req_sync
        return ret["ret"]
    TypeError: 'NoneType' object is not subscriptable

Also, before applying the patch it was possible to send "service.restart salt-minion" command and have the minion reconnect to the master. This doesn't seem to be the case anymore.

Hope this helps...

vkotarov avatar Feb 26 '25 17:02 vkotarov