Signal handler can interrupt IO system call and cause errors
Affiliation SPC-EPFL
Version(s) Affected The MDSplus Version(s) affected, if any. e.g. Client Version: Alpha 7.148.0, Server Version: Stable 7.142.81
Platform(s) Ubuntu 24.04
Installation Method(s) Official MDSplus DEB repository
Describe the bug
When starting a new connection just after disconnecting one using one of the "Tunnel" protocols, the login function fails due to one system call being interrupted by the ChildSignalHandler from IoRoutinesTunnel.c.
To Reproduce
- Here is a minimal C program:
#include <stdio.h>
#include <unistd.h>
#include "ipdesc.h"
int main(void)
{
char *TCPuri = "tcvdata.epfl.ch";
char *SSHuri = "ssh://tcvdata.epfl.ch";
int TCPnsock;
int SSHnsock;
/* Connect to SSH server */
printf("Opening connection to SSH MDS server\n");
SSHnsock = ConnectToMds(SSHuri);
if (SSHnsock == -1)
printf("Could not open connection to SSH MDS server\n");
/* Disconnect SSH server */
printf("Closing connection to SSH MDS server\n");
DisconnectFromMds(SSHnsock);
/*
usleep(10000);
usleep(10000);
*/
/* Connect to TCP server */
printf("Opening connection to TCP MDS server\n");
TCPnsock = ConnectToMds(TCPuri);
if (TCPnsock == -1)
printf("Could not open connection to TCP MDS server\n");
}
- Compile using
gcc test.c -o test -I $MDSPLUS_DIR/include -L $MDSPLUS_DIR/lib -l MdsIpShr
- Run it
./test
- See error below:
Opening connection to SSH MDS server
Closing connection to SSH MDS server
Opening connection to TCP MDS server
E, 2759129:2759129, 1731410141.350875194: mdstcpip/mdsipshr/GetMdsMsg.c:63 get_bytes_to() Connection(id=-1, state=0x00, protocol='tcp', info_name='tcp', version=0, user='(null)') error 0/48: Interrupted system call
Error during login: recv: Interrupted system call
Could not open connection to TCP MDS server
Expected behavior The new connection should complete successfully.
Screenshots
Additional context
Uncommenting the usleep commands after the disconnect step in the program above allows the connection to complete.
Just as a comment: if i remember correctly. the Signal is issued in order to terminate the old connection. In order to keep DisconnectFromMds as fast call it is not waiting for the old connection to terminate but rather issues its termination. I think best to fix this would be to harden recv against interrupts by handling the interrupt error code. It should check the active state of teh connection and retry as long as the connection is claimed to be active. DisconnectFromMds would reset the active state of a connection before issuing the signal.
I think in the case of a standard disconnect ChildSignalHandler actually doesn't do much since the destruction of the connection has already started in CloseConnection which is called by DisconnectFromMds. It is truly useful when the tunnel is lost due to an unforeseen event either on the client or server side.
I agree that the best fix is to make the system calls more robust against signal interrupts. I noticed that the SA_RESTART flag was already used when registering ChildSignalHandler but apparently some systems call can't be restarted. This Unix StackExchange answer offers some explanation and possible handling scheme.
The restart just re-installs the handler once it was triggered, no? .. i think the handler is meant to catch SIGPIPE after connection close closes the fid used for the connection.
i think a good way to check if connection is still in use would be to use Connection::state with CON_INLIST as it is cleared in PopConnection
c->state &= CON_ACTIVITY; // clears INLIST
https://github.com/MDSplus/mdsplus/blob/80eced46cb1f180eaf870e7b2e35859db85a7f34/mdstcpip/mdsipshr/Connections.c#L70
that said .. maybe this is why it does not retry.. during teh connection task the state is not yet INLIST .. so one would have to add a dedicated flag like. CON_INIT that would tell recv to retry when connecting
SA_RESTART is meant to restart the system call, not reinstall the handler.
The issue is not when trying to close a connection that is inactive or dead, it's an interrupt issue on a system call for any other active connection.
The following patch solves the issue for the program above (but only addresses the one syscall that was affected, it should be generalized)
diff --git a/mdstcpip/io_routines/ioroutinesx.h b/mdstcpip/io_routines/ioroutinesx.h
index a01403f32..e04520860 100644
--- a/mdstcpip/io_routines/ioroutinesx.h
+++ b/mdstcpip/io_routines/ioroutinesx.h
@@ -419,7 +419,8 @@ static ssize_t io_recv_to(Connection *c, void *bptr, size_t num, int to_msec)
{
to = timeout;
rf = readfds;
- recved = select(sock + 1, &rf, NULL, NULL, &to);
+ while ((recved = select(sock + 1, &rf, NULL, NULL, &to)) == -1 && errno == EINTR)
+ continue;
#else
struct pollfd fd;
fd.fd = sock;
.. not sure how this affects SIGINT through user keyboard interrupt. .. What I take from your link
Because you may want to do something upon receiving a signal, and you cannot do much from a signal handler; anything using malloc() or stdio (even printf()) is out of the question. So you have to handle the interrupts in the main loop of the program, and to be able to do that, you should break somehow from a blocking read() (a read() can block even after a poll() has returned a fd as ready for reading).
The handler should mark the connection as closed which it does kind of in CloseConncetion (removing it from the list, but we need an extra flag for connect or set the INLIST before connect) recv etc should the only retry if the specific connection was not affected. if we need to distinguish between SIGINT and SIGPIPE/SIGCHLD here i think shoudl at least be tested . e.g. in tditest. connecting to a dummy server and Ctrl+C to see it the connect is still properly interrupted.
The issue is not in what the handler does. Again the connection is properly closed and this is all fine. The problem is because a handler is registered, when the signal is caught it interrupts the current system call (for another connection) which will fail and not be retried.
Actually it seems that some protocol do correctly handle interrupts. EINTR is correctly caught and retried in io_recv_to from ioroutines_pipes.h but not in io_recv_to from ioroutinestcp.h.