'device to tcp state' stuck on 'closing'
Sometimes our connections get stuck on closing state, as seen in showport on admin console. Seems that outside windows clients encountered this problem more often (they were also connecting via raw tcp socket and they were too lazy to fix), so we moved to localhost and remote machines connect using HTTP API -> python telnet . On localhost ubuntu machine the problem seems to be a lot more rare, but still appears.
When I connect to a port that is stuck in this fashion, I may get some initial output, but it is most likely from the buffer, as I get it every time I try to connect. Attempting to send input or generate more output from serial side does not result in output being sent to connections.
Ser2net is set up as a user daemon:
init.d config
$ cat /etc/init.d/ser2net
#! /bin/bash
#
# ser2net init script for ser2net
#
# Written by Miquel van Smoorenburg <[email protected]>.
# Modified for Debian GNU/Linux
# by Ian Murdock <[email protected]>.
# Modified for ser2net by Marc Haber <[email protected]>
### BEGIN INIT INFO
# Provides: ser2net
# Required-Start: $network $remote_fs
# Required-Stop: $network $remote_fs
# Should-Start:
# Should-Stop:
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Allows network connections to serial ports
# Description: This daemon allows telnet and tcp sessions to be established with a unit's serial ports.
### END INIT INFO
set -e
if [ -r "/lib/lsb/init-functions" ]; then
. /lib/lsb/init-functions
else
echo "E: /lib/lsb/init-functions not found, lsb-base (>= 3.0-6) needed"
exit 1
fi
if [ -n "$SER2NETDEBUG" ]; then
echo "now debugging $0 $@"
set -x
fi
LANG=C
export LANG
PATH=/sbin:/bin:/usr/sbin:/usr/bin
DAEMON=/usr/sbin/ser2net
NAME=ser2net
DESC="Serial port to network proxy"
PIDFILE=/run/$NAME.pid
test -f $DAEMON || exit 0
# Defaults
CONFFILE="/etc/ser2net.yaml"
OPTIONS=""
CONTROLPORT=""
# Read config file (will override defaults above)
[ -r /etc/default/ser2net ] && . /etc/default/ser2net
# this is from madduck on IRC, 2006-07-06
# There should be a better possibility to give daemon error messages
# and/or to log things
log()
{
case "$1" in
[[:digit:]]*) success=$1; shift;;
*) :;;
esac
log_action_begin_msg "$1"; shift
log_action_end_msg ${success:-0} "$*"
}
start () {
if ! pidofproc -p "$PIDFILE" "$DAEMON" >/dev/null; then
start_daemon -p $PIDFILE $DAEMON ${CONTROLPORT:+-p} $CONTROLPORT -c $CONFFILE -P $PIDFILE $OPTIONS
ret=$?
else
log_failure_msg "already running!"
log_end_msg 1
exit 1
fi
return $ret
}
stop () {
# this is a workaround for #451529 as ser2net 2.5 does not delete its pidfile
SIG="${1:--TERM}"
killproc -p "$PIDFILE" "$DAEMON" "$SIG"
# this is a workaround for killproc -TERM not zapping the pidfile
if ! pidofproc -p "$PIDFILE" "$DAEMON" >/dev/null; then
rm -f $PIDFILE
fi
}
status()
{
log_action_begin_msg "checking $DESC"
if pidofproc -p "$PIDFILE" "$DAEMON" >/dev/null; then
log_action_end_msg 0 "$NAME running"
else
if [ -e "$PIDFILE" ]; then
log_action_end_msg 1 "$NAME failed"
exit 1
else
log_action_end_msg 0 "$NAME not running"
exit 3
fi
fi
}
if ! [ -e "$CONFFILE" ]; then
log_failure_msg "Not starting ser2net: Conffile $CONFFILE missing"
log_end_msg 1
exit 1
fi
case "$1" in
start)
log_daemon_msg "Starting $DESC" "$NAME"
start
log_end_msg 0
;;
stop)
log_daemon_msg "Stopping $DESC" "$NAME"
stop
log_end_msg 0
;;
reload|force-reload)
log_daemon_msg "Reloading $DESC" "$NAME"
stop "-HUP"
log_end_msg 0
;;
restart)
log_daemon_msg "Restarting $DESC" "$NAME"
stop
sleep 1
start
log_end_msg 0
;;
status)
status
;;
*)
N=/etc/init.d/$NAME
echo "Usage: $N {start|stop|restart|reload|force-reload|status}" >&2
exit 1
;;
esac
exit 0
Ser2net config
%YAML 1.1
---
# This is a ser2net configuration file, tailored to be rather
# simple.
#
# Find detailed documentation in ser2net.yaml(5)
# A fully featured configuration file is in
# /usr/share/doc/ser2net/examples/ser2net.yaml.gz
#
# If you find your configuration more useful than this very simple
# one, please submit it as a bugreport
admin: &admin1
accepter: tcp,2000
#########################
...
# initial serial: XXXXXXXXXXX, full path: /dev/serial/by-path/pci-0000:00:14.0-usb-0:3.1.1.4:1.0-port0
connection: &con225
accepter: telnet(rfc2217),tcp,38225
connector: serialdev,/dev/serial/by-path/pci-0000:00:14.0-usb-0:3.1.1.4:1.0-port0,115200n81,local
enable: on
timeout: 18000
options:
max-connections: 7
Admin console 'shortport'
-> showport con225
showport con225
port: con225
accepter: telnet(rfc2217),tcp,38225
enable state: on
timeout: 18000
connected to: ipv6,::ffff:127.0.0.1,53900
bytes read from TCP: 27
bytes written to TCP: 1860893
connected to: ipv6,::ffff:127.0.0.1,43274
bytes read from TCP: 16
bytes written to TCP: 19
connected to: ipv6,::ffff:127.0.0.1,56532
bytes read from TCP: 16
bytes written to TCP: 19
connected to: ipv6,::ffff:127.0.0.1,57802
bytes read from TCP: 16
bytes written to TCP: 19
connected to: ipv6,::ffff:127.0.0.1,53022
bytes read from TCP: 16
bytes written to TCP: 19
connected to: ipv6,::ffff:127.0.0.1,34462
bytes read from TCP: 16
bytes written to TCP: 19
device: serialdev,/dev/serial/by-path/pci-0000:00:14.0-usb-0:3.1.1.4:1.0-port0,115200n81,local
device config: 115200N81,CLOCAL,HANGUP_WHEN_DONE
device controls: RTSHI,DTRHI
tcp to device state: waiting output
device to tcp state: closing
bytes read from device: 1860912
bytes written to device: 12233
We have 3 similar setups on other ubuntu machines, no problems there for more than a year, however there we connect directly via python telnet lib instead of HTTP API, and have older versions of ser2net (4.5.0) and gensiot (2.7.6).
Workaround
So far managed to fix these ports by performing setportenable off|on via admin console. Another thing that fixes this issue is the connection timeout from ser2net config.
I thought it could perhaps be TCP deadlock (similar to iperf3 issue I encountered), but not sure how that could happen on localhost, as it requires loosing packets. I assume that one TCP connection hangs, buffers fill and no output can be sent, because ser2net wants to send output to all connections, but one of them does not respond.
Tried only with these versions so far:
gensiot 2.8.2 (built from source)
ser2net 4.6.0 (built from source, might be newer than actual version)
Can try to reproduce with other versions if you need, but I can only reproduce this on our specific infrastructure and within a day or two.
This is an incredible and useful amount of detail.
There is a bug in showport and friends, the state string is off by one. The actual values being reported for the states are "waiting input" and "waiting output", not "waiting output" and "closing". The "waiting output" state there means one of the network instances there is unable to be written to, as you have already surmised. It's probably the first one, as it's the one that has all the data written to it; the others appear to not have data, so I assume those connections came in after the first one hung.
You have rfc2217 configured, are the connections actually using that? If not, could you turn it off? The only major changes to ser2net between 4.5.0 and 4.6.0 are dealing with rfc2217 handling, but if you aren't using that, it's not going to make much difference there.
But nothing has changed in the networking code or in the base code between gensio 2.7.6 and 2.8.2 except for one change that shouldn't cause any issue like this. There was one change in the telnet code involving initialization sequences that might be applicable.
You don't appear to be using multiple threads in ser2net, so it's not a race condition (at least not that kind of race condition).
If I understand correctly, you were connecting directly to ser2net from a windows application and were seeing the problem often. You changed to using an HTTP proxy to receive HTTP connections that then get converted to TCP connections to ser2net and the problem lessened, but didn't go away.
It is at least theoretically possible that it's an issue in the Windows application. If it stops reading data for some reason, this is what you would see. It's possible, for instance, that the telnet changes I mentioned above changed the timing of initialization and confused the Windows app. Do you know in this case if the Windows application has received any data on the port? Since there are lots of buffers between ser2net and the Windows app, 1860893 bytes is fairly close to what I would expect if something just opened a port and never read from it.
Outside of that, if you could try gensio 2.7.6 and ser2net 4.5.0 on the system in question (just to eliminate the hardware or something else about that system or the environment), that would probably be the next step. If that eliminates the problem, we can start bisecting.
Thanks,
-corey
Not sure if we can call it an HTTP proxy in this case (uncertain about the term itself, maybe it's correct), we use it as such:
- Windows machine performs API/HTTP request to
/api/read_until - HTTP/API server receives request, makes a separate connection to
ser2netwith pythonpyseriallib. - Data is read until timeout or some value is found
- Close
ser2netconnection, return data for HTTP/API
So the windows machines no longer directly connect to ser2net, just ask the API to perform actions. Windows machines previously connected directly via TCP, the new "proxy" uses pyserial and connects using rfc2217. With pyserial we only read/write, no negotiation or sending special signals, so perhaps we could go back to using raw TCP, but it is more convenient when we want to connect to ser2net manually, because connecting to raw TCP mangles newlines, echo, etc.
I will attempt your suggestion today and revert to the older ser2net and gensio versions, but results could take up to monday.
Another issue I ran into today was trying to debug a port, executed showshortport con225 in admin cli and ser2net got stuck. Had to restart whole daemon, now when connecting to some of the ports I get:
Connected to X.X.X.X.
Escape character is '^]'.
Device open failure: Value or file not found
Connection closed by foreign host.
This issue could perhaps be related to disconnecting FTDI devices due to lack of power in hubs, will investigate further.
On Thu, Mar 14, 2024 at 05:23:43AM -0700, ffolkes1911 wrote:
Not sure if we can call it an HTTP proxy in this case (uncertain about the term itself, maybe it's correct), we use it as such:
- Windows machine performs API/HTTP request to
/api/read_until- HTTP/API server receives request, makes a separate connection to
ser2netwith pythonpyseriallib.- Data is read until timeout or some value is found
- Close
ser2netconnection, return data for HTTP/API
I'd call it a proxy, but it's all a matter of definition, I suppose.
So the windows machines no longer directly connect to ser2net, just ask the API to perform actions. Windows machines previously connected directly via TCP, the new "proxy" uses
pyserialand connects usingrfc2217. Withpyserialwe only read/write, no negotiation or sending special signals, so perhaps we could go back to using raw TCP, but it is more convenient when we want to connect toser2netmanually, because connecting to raw TCP mangles newlines, echo, etc.
I'm not sure I understand. If you use raw TCP on both sides, you shouldn't have any trouble with echo or things like. But if you have raw TCP on one side and not the other, then you will see problems.
If you aren't doing any rfc2217 control or sending serial breaks, there's no point in having telnet in the mix for what you are doing.
I will attempt your suggestion today and revert to the older
ser2netandgensioversions, but results could take up to monday.
Ok, that's good. I'm going to be traveling, but I will try to keep up.
Another issue I ran into today was trying to debug a port, executed
showshortport con225in admin cli andser2netgot stuck. Had to restart whole daemon, now when connecting to some of the ports I get:Connected to X.X.X.X. Escape character is '^]'. Device open failure: Value or file not found Connection closed by foreign host.This issue could perhaps be related to disconnecting FTDI devices due to lack of power in hubs, will investigate further.
This looks like a hardware or driver failure. I've seen driver hangs like this with flaky hardware. The "Value or file not found" means that the /dev/tty... serial port doesn't exist any more, so it somehow went away or can't be opened.
Day one - no issues so far with older ser2net and gensio, will check after weekend. Only changed ser2net and gensio, no additional USB hub power, different FTDI chips or config modifications, to keep variables to a minimum.
If you use raw TCP on both sides, you shouldn't have any trouble with echo or things like.
Retested today on separate machine to make sure. Raw TCP (both sides, in ser2net config only removed telnet(rfc2217)) echos my own input (probably because our devices echo serial input back to FTDI) and we lose all pseudoterminal functionality (visible inputs, no 'tab' for autocomplete). Connected using socat - TCP:localhost:38001.
Same thing happens when connecting raw TCP client -> ser2net telnet, so I suspect that socat does not transmit some keystrokes correctly (tab, arrowkeys). There are ways to map them in socat, but this is just out of curiosity, it would be too difficult to use/setup.
Dropping rfc2217 and using just telnet (both sides) worked fine during short test. I had forgotten that telnet and rfc2217 are separate things, seems that rfc2217 adds something like serial option negotiation, but telnet itself has the pseudoterminal functionality.
Unfortunately did not reproduce identical initial error yet, however we had a new problem. Turns out an error in our project was leaking connections (not closing them) in certain scenarios. This gave me an idea and I tested a scenario which produces an error similar to the initial error:
- connect to ser2net via another device
- disconnect device from common network (unplug ethernet)
- connect to ser2net via different device (can be localhost)
- spam input/output to fill buffers
This results in output freezing on the second connection, admin console shows see device to tcp state: closing, new connections get some initial output, but cannot produce new output. In admin console I can still see the first connection, so I assume that ser2net attempts to send data there, but there's no response, buffers fill and everything freezes.
If this is indeed the same problem then fixing connection leaking in my project should fix these issues, however this can still raise problems in the future. I tried to find some option for tcp keepalive/timeout in gensio github and man pages with no success. Are there no such options in gensio and I would have to resort to changing system defaults (I assume that's what gensio uses) ?
On Tue, Mar 19, 2024 at 03:49:44AM -0700, ffolkes1911 wrote:
Unfortunately did not reproduce identical initial error yet, however we had a new problem. Turns out an error in our project was leaking connections (not closing them) in certain scenarios. This gave me an idea and I tested a scenario which produces an error similar to the initial error:
- connect to ser2net via another device
- disconnect device from common network (unplug ethernet)
- connect to ser2net via different device (can be localhost)
- spam input/output to fill buffers
This results in output freezing on the second connection, admin console shows see
device to tcp state: closing, new connections get some initial output, but cannot produce new output. In admin console I can still see the first connection, so I assume thatser2netattempts to send data there, but there's no response, buffers fill and everything freezes.If this is indeed the same problem then fixing connection leaking in my project should fix these issues, however this can still raise problems in the future. I tried to find some option for tcp keepalive/timeout in
gensiogithub andmanpages with no success. Are there no such options ingensioand I would have to resort to changing system defaults (I assume that's whatgensiouses) ?
That does make sense. I was wondering about all the connections.
It does set TCP keepalive on the connection by default (you can't turn it off). Or at least it should, I don't really have a testcase for that because it's pretty hard to do. I can try to do a hand test with a VM or something. The timeout on that is pretty long, 7200 seconds before starting keeplive. You can do "man 7 tcp" and search for keepalive for info. If there is data, the timeout is around 90 seconds, I believe.
I'm not sure how fast your things run, but those are long timeouts and the 90 second one cannot be changed.
But thinking about that some more, if the connection on your client's side is still open, keepalives won't help. As far as the OS is concerned the connection is open still; it won't time it out. So there's not much I can do about it. The buffer being full is not a condition where it will time it out. You could set the ser2net timeouts, but that's probably not what you want to do.
I've been sick for most of last week, sorry for the delay.
Seems that the original issue was not reproduced with older ser2net and gensio, but as I mentioned that may be due to our connection leak fix, so I will go back again to latest ser2net and gensio versions and check if problem persists.
If there is data, the timeout is around 90 seconds, I believe
It seems that I had misunderstood how it works. Tested today (python TCP socket client -> ser2net server) and it appears that yes, if any data is being sent over TCP socket, even if receiver is unreachable, it resets the keepalive "idle" timer and connection does not close. Unsure about that timeout value, seems somewhat random between 90-960s on client laptop (all default values).
from ser2net.yaml (4.5.0)
MULTIPLE CONNECTIONS As mentioned earlier, you can set max-connections=
on a port to allow more than one connection at a time to the same serial port. These connections will share all the settings. ...
timeout will be per TCP port and will only disconnect that TCP port on a timeout.
I have also tested ser2net timeout implementation, but the hanging connection does not get closed if any traffic is being sent to it (generated traffic from server side), so it seems to work similarly to TCP keepalive, and would maybe only work when buffers are full.
We will probably need to reduce default TCP keepalive values on the server, because it looks like a good practice on servers.
After going back to the new versions (ser2net 4.6.0), one port got stuck with hanging connections overnight. I have some python logs, but I later reproduced them when I maxed out connections, so it's most likely not the cause:
Python connection logs
(03-26 07:30:29) ERROR : Serial connect failed
(03-26 07:30:29) DEBUG : Traceback (most recent call last):
File ".../our_packages/serial_operations.py", line 66, in serial_connect
self.CONN = serial.serial_for_url(
File ".../lib/python3.10/site-packages/serial/__init__.py", line 90, in serial_for_url
instance.open()
File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 481, in open
raise SerialException(
serial.serialutil.SerialException: Remote does not seem to support RFC2217 or BINARY mode [we-BINARY:True(ACTIVE), we-RFC2217:False(REQUESTED)]
....
(03-26 07:32:21) ERROR : Serial connect failed
(03-26 07:32:21) DEBUG : Traceback (most recent call last):
File ".../our_packages/serial_operations.py", line 66, in serial_connect
self.CONN = serial.serial_for_url(
File ".../lib/python3.10/site-packages/serial/__init__.py", line 90, in serial_for_url
instance.open()
File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 487, in open
self._reconfigure_port()
File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 515, in _reconfigure_port
self._rfc2217_port_settings['baudrate'].set(struct.pack(b'!I', self._baudrate))
File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 335, in set
self.connection.rfc2217_send_subnegotiation(self.option, self.value)
File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 867, in rfc2217_send_subnegotiation
self._internal_raw_write(IAC + SB + COM_PORT_OPTION + option + value + IAC + SE)
File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 858, in _internal_raw_write
self._socket.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
I will try to setup more python logging and see if we get more info.
Hi again, sorry for the delay. We've been increasing and analyzing logging, and I think we found the problem. It seems that there's a memleak in FastAPI, where upon raising any error inside an endpoint, none of the local scope variables are cleaned by GC, so our teardown did not close connections to ser2net occasionally. We fixed this on our end, and tested with initial ser2net version (4.6.0, built from source) - no problems, closing issue.
Thank you for the assistance and keep up the great work.