ser2net icon indicating copy to clipboard operation
ser2net copied to clipboard

'device to tcp state' stuck on 'closing'

Open ffolkes1911 opened this issue 2 years ago • 8 comments

Sometimes our connections get stuck on closing state, as seen in showport on admin console. Seems that outside windows clients encountered this problem more often (they were also connecting via raw tcp socket and they were too lazy to fix), so we moved to localhost and remote machines connect using HTTP API -> python telnet . On localhost ubuntu machine the problem seems to be a lot more rare, but still appears.

When I connect to a port that is stuck in this fashion, I may get some initial output, but it is most likely from the buffer, as I get it every time I try to connect. Attempting to send input or generate more output from serial side does not result in output being sent to connections.

Ser2net is set up as a user daemon:

init.d config
$ cat /etc/init.d/ser2net 
#! /bin/bash
#
# ser2net	init script for ser2net
#
#		Written by Miquel van Smoorenburg <[email protected]>.
#		Modified for Debian GNU/Linux
#		by Ian Murdock <[email protected]>.
#		Modified for ser2net by Marc Haber <[email protected]>

### BEGIN INIT INFO
# Provides:          ser2net
# Required-Start:    $network $remote_fs
# Required-Stop:     $network $remote_fs
# Should-Start:
# Should-Stop:
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Allows network connections to serial ports
# Description:       This daemon allows telnet and tcp sessions to be established with a unit's serial ports.
### END INIT INFO

set -e

if [ -r "/lib/lsb/init-functions" ]; then
  . /lib/lsb/init-functions
else
  echo "E: /lib/lsb/init-functions not found, lsb-base (>= 3.0-6) needed"
  exit 1
fi

if [ -n "$SER2NETDEBUG" ]; then
  echo "now debugging $0 $@"
  set -x
fi

LANG=C
export LANG

PATH=/sbin:/bin:/usr/sbin:/usr/bin
DAEMON=/usr/sbin/ser2net
NAME=ser2net
DESC="Serial port to network proxy"
PIDFILE=/run/$NAME.pid

test -f $DAEMON || exit 0


# Defaults
CONFFILE="/etc/ser2net.yaml"
OPTIONS=""
CONTROLPORT=""

# Read config file (will override defaults above)
[ -r /etc/default/ser2net ] && . /etc/default/ser2net

# this is from madduck on IRC, 2006-07-06
# There should be a better possibility to give daemon error messages
# and/or to log things
log()
{
  case "$1" in
    [[:digit:]]*) success=$1; shift;;
    *) :;;
  esac
  log_action_begin_msg "$1"; shift
  log_action_end_msg ${success:-0} "$*"
}

start () {
  if ! pidofproc -p "$PIDFILE" "$DAEMON" >/dev/null; then
      start_daemon -p $PIDFILE $DAEMON ${CONTROLPORT:+-p} $CONTROLPORT -c $CONFFILE -P $PIDFILE $OPTIONS
      ret=$?
  else
    log_failure_msg "already running!"
    log_end_msg 1
    exit 1
  fi
  return $ret
}

stop () {
  # this is a workaround for #451529 as ser2net 2.5 does not delete its pidfile
  SIG="${1:--TERM}"
  killproc -p "$PIDFILE" "$DAEMON" "$SIG"
  # this is a workaround for killproc -TERM not zapping the pidfile
  if ! pidofproc -p "$PIDFILE" "$DAEMON" >/dev/null; then
    rm -f $PIDFILE
  fi
}

status()
{
  log_action_begin_msg "checking $DESC"
  if pidofproc -p "$PIDFILE" "$DAEMON" >/dev/null; then
    log_action_end_msg 0 "$NAME running"
  else
    if [ -e "$PIDFILE" ]; then
      log_action_end_msg 1 "$NAME failed"
      exit 1
    else
      log_action_end_msg 0 "$NAME not running"
      exit 3
    fi
  fi
}

if ! [ -e "$CONFFILE" ]; then
  log_failure_msg "Not starting ser2net: Conffile $CONFFILE missing"
  log_end_msg 1
  exit 1
fi

case "$1" in
  start)
	log_daemon_msg "Starting $DESC" "$NAME"
	start
	log_end_msg 0
	;;
  stop)
	log_daemon_msg "Stopping $DESC" "$NAME"
	stop
	log_end_msg 0
	;;
  reload|force-reload)
	log_daemon_msg "Reloading $DESC" "$NAME"
	stop "-HUP"
	log_end_msg 0
	;;
  restart)
	log_daemon_msg "Restarting $DESC" "$NAME"
	stop
	sleep 1
	start
	log_end_msg 0
	;;
  status)
  	status
	;;
  *)
	N=/etc/init.d/$NAME
	echo "Usage: $N {start|stop|restart|reload|force-reload|status}" >&2
	exit 1
	;;
esac

exit 0

Ser2net config
%YAML 1.1
---
# This is a ser2net configuration file, tailored to be rather
# simple.
#
# Find detailed documentation in ser2net.yaml(5)
# A fully featured configuration file is in
# /usr/share/doc/ser2net/examples/ser2net.yaml.gz
# 
# If you find your configuration more useful than this very simple
# one, please submit it as a bugreport

admin: &admin1
  accepter: tcp,2000


#########################

...

# initial serial: XXXXXXXXXXX, full path: /dev/serial/by-path/pci-0000:00:14.0-usb-0:3.1.1.4:1.0-port0
connection: &con225
  accepter: telnet(rfc2217),tcp,38225
  connector: serialdev,/dev/serial/by-path/pci-0000:00:14.0-usb-0:3.1.1.4:1.0-port0,115200n81,local
  enable: on
  timeout: 18000
  options:
    max-connections: 7

Admin console 'shortport'
-> showport con225
showport con225
port: con225
  accepter: telnet(rfc2217),tcp,38225
  enable state: on
  timeout: 18000
  connected to: ipv6,::ffff:127.0.0.1,53900
    bytes read from TCP: 27
    bytes written to TCP: 1860893
  connected to: ipv6,::ffff:127.0.0.1,43274
    bytes read from TCP: 16
    bytes written to TCP: 19
  connected to: ipv6,::ffff:127.0.0.1,56532
    bytes read from TCP: 16
    bytes written to TCP: 19
  connected to: ipv6,::ffff:127.0.0.1,57802
    bytes read from TCP: 16
    bytes written to TCP: 19
  connected to: ipv6,::ffff:127.0.0.1,53022
    bytes read from TCP: 16
    bytes written to TCP: 19
  connected to: ipv6,::ffff:127.0.0.1,34462
    bytes read from TCP: 16
    bytes written to TCP: 19
  device: serialdev,/dev/serial/by-path/pci-0000:00:14.0-usb-0:3.1.1.4:1.0-port0,115200n81,local
  device config: 115200N81,CLOCAL,HANGUP_WHEN_DONE
  device controls: RTSHI,DTRHI
  tcp to device state: waiting output
  device to tcp state: closing
  bytes read from device: 1860912
  bytes written to device: 12233

We have 3 similar setups on other ubuntu machines, no problems there for more than a year, however there we connect directly via python telnet lib instead of HTTP API, and have older versions of ser2net (4.5.0) and gensiot (2.7.6).

Workaround

So far managed to fix these ports by performing setportenable off|on via admin console. Another thing that fixes this issue is the connection timeout from ser2net config.

I thought it could perhaps be TCP deadlock (similar to iperf3 issue I encountered), but not sure how that could happen on localhost, as it requires loosing packets. I assume that one TCP connection hangs, buffers fill and no output can be sent, because ser2net wants to send output to all connections, but one of them does not respond.

Tried only with these versions so far: gensiot 2.8.2 (built from source)
ser2net 4.6.0 (built from source, might be newer than actual version)

Can try to reproduce with other versions if you need, but I can only reproduce this on our specific infrastructure and within a day or two.

ffolkes1911 avatar Mar 13 '24 07:03 ffolkes1911

This is an incredible and useful amount of detail.

There is a bug in showport and friends, the state string is off by one. The actual values being reported for the states are "waiting input" and "waiting output", not "waiting output" and "closing". The "waiting output" state there means one of the network instances there is unable to be written to, as you have already surmised. It's probably the first one, as it's the one that has all the data written to it; the others appear to not have data, so I assume those connections came in after the first one hung.

You have rfc2217 configured, are the connections actually using that? If not, could you turn it off? The only major changes to ser2net between 4.5.0 and 4.6.0 are dealing with rfc2217 handling, but if you aren't using that, it's not going to make much difference there.

But nothing has changed in the networking code or in the base code between gensio 2.7.6 and 2.8.2 except for one change that shouldn't cause any issue like this. There was one change in the telnet code involving initialization sequences that might be applicable.

You don't appear to be using multiple threads in ser2net, so it's not a race condition (at least not that kind of race condition).

If I understand correctly, you were connecting directly to ser2net from a windows application and were seeing the problem often. You changed to using an HTTP proxy to receive HTTP connections that then get converted to TCP connections to ser2net and the problem lessened, but didn't go away.

It is at least theoretically possible that it's an issue in the Windows application. If it stops reading data for some reason, this is what you would see. It's possible, for instance, that the telnet changes I mentioned above changed the timing of initialization and confused the Windows app. Do you know in this case if the Windows application has received any data on the port? Since there are lots of buffers between ser2net and the Windows app, 1860893 bytes is fairly close to what I would expect if something just opened a port and never read from it.

Outside of that, if you could try gensio 2.7.6 and ser2net 4.5.0 on the system in question (just to eliminate the hardware or something else about that system or the environment), that would probably be the next step. If that eliminates the problem, we can start bisecting.

Thanks,

-corey

cminyard avatar Mar 13 '24 16:03 cminyard

Not sure if we can call it an HTTP proxy in this case (uncertain about the term itself, maybe it's correct), we use it as such:

  1. Windows machine performs API/HTTP request to /api/read_until
  2. HTTP/API server receives request, makes a separate connection to ser2net with python pyserial lib.
  3. Data is read until timeout or some value is found
  4. Close ser2net connection, return data for HTTP/API

So the windows machines no longer directly connect to ser2net, just ask the API to perform actions. Windows machines previously connected directly via TCP, the new "proxy" uses pyserial and connects using rfc2217. With pyserial we only read/write, no negotiation or sending special signals, so perhaps we could go back to using raw TCP, but it is more convenient when we want to connect to ser2net manually, because connecting to raw TCP mangles newlines, echo, etc.

I will attempt your suggestion today and revert to the older ser2net and gensio versions, but results could take up to monday.

Another issue I ran into today was trying to debug a port, executed showshortport con225 in admin cli and ser2net got stuck. Had to restart whole daemon, now when connecting to some of the ports I get:

Connected to X.X.X.X.
Escape character is '^]'.
Device open failure: Value or file not found
Connection closed by foreign host.

This issue could perhaps be related to disconnecting FTDI devices due to lack of power in hubs, will investigate further.

ffolkes1911 avatar Mar 14 '24 12:03 ffolkes1911

On Thu, Mar 14, 2024 at 05:23:43AM -0700, ffolkes1911 wrote:

Not sure if we can call it an HTTP proxy in this case (uncertain about the term itself, maybe it's correct), we use it as such:

  1. Windows machine performs API/HTTP request to /api/read_until
  2. HTTP/API server receives request, makes a separate connection to ser2net with python pyserial lib.
  3. Data is read until timeout or some value is found
  4. Close ser2net connection, return data for HTTP/API

I'd call it a proxy, but it's all a matter of definition, I suppose.

So the windows machines no longer directly connect to ser2net, just ask the API to perform actions. Windows machines previously connected directly via TCP, the new "proxy" uses pyserial and connects using rfc2217. With pyserial we only read/write, no negotiation or sending special signals, so perhaps we could go back to using raw TCP, but it is more convenient when we want to connect to ser2net manually, because connecting to raw TCP mangles newlines, echo, etc.

I'm not sure I understand. If you use raw TCP on both sides, you shouldn't have any trouble with echo or things like. But if you have raw TCP on one side and not the other, then you will see problems.

If you aren't doing any rfc2217 control or sending serial breaks, there's no point in having telnet in the mix for what you are doing.

I will attempt your suggestion today and revert to the older ser2net and gensio versions, but results could take up to monday.

Ok, that's good. I'm going to be traveling, but I will try to keep up.

Another issue I ran into today was trying to debug a port, executed showshortport con225 in admin cli and ser2net got stuck. Had to restart whole daemon, now when connecting to some of the ports I get:

Connected to X.X.X.X.
Escape character is '^]'.
Device open failure: Value or file not found
Connection closed by foreign host.

This issue could perhaps be related to disconnecting FTDI devices due to lack of power in hubs, will investigate further.

This looks like a hardware or driver failure. I've seen driver hangs like this with flaky hardware. The "Value or file not found" means that the /dev/tty... serial port doesn't exist any more, so it somehow went away or can't be opened.

cminyard avatar Mar 14 '24 16:03 cminyard

Day one - no issues so far with older ser2net and gensio, will check after weekend. Only changed ser2net and gensio, no additional USB hub power, different FTDI chips or config modifications, to keep variables to a minimum.

If you use raw TCP on both sides, you shouldn't have any trouble with echo or things like.

Retested today on separate machine to make sure. Raw TCP (both sides, in ser2net config only removed telnet(rfc2217)) echos my own input (probably because our devices echo serial input back to FTDI) and we lose all pseudoterminal functionality (visible inputs, no 'tab' for autocomplete). Connected using socat - TCP:localhost:38001.
Same thing happens when connecting raw TCP client -> ser2net telnet, so I suspect that socat does not transmit some keystrokes correctly (tab, arrowkeys). There are ways to map them in socat, but this is just out of curiosity, it would be too difficult to use/setup.

Dropping rfc2217 and using just telnet (both sides) worked fine during short test. I had forgotten that telnet and rfc2217 are separate things, seems that rfc2217 adds something like serial option negotiation, but telnet itself has the pseudoterminal functionality.

ffolkes1911 avatar Mar 15 '24 12:03 ffolkes1911

Unfortunately did not reproduce identical initial error yet, however we had a new problem. Turns out an error in our project was leaking connections (not closing them) in certain scenarios. This gave me an idea and I tested a scenario which produces an error similar to the initial error:

  1. connect to ser2net via another device
  2. disconnect device from common network (unplug ethernet)
  3. connect to ser2net via different device (can be localhost)
  4. spam input/output to fill buffers

This results in output freezing on the second connection, admin console shows see device to tcp state: closing, new connections get some initial output, but cannot produce new output. In admin console I can still see the first connection, so I assume that ser2net attempts to send data there, but there's no response, buffers fill and everything freezes.

If this is indeed the same problem then fixing connection leaking in my project should fix these issues, however this can still raise problems in the future. I tried to find some option for tcp keepalive/timeout in gensio github and man pages with no success. Are there no such options in gensio and I would have to resort to changing system defaults (I assume that's what gensio uses) ?

ffolkes1911 avatar Mar 19 '24 10:03 ffolkes1911

On Tue, Mar 19, 2024 at 03:49:44AM -0700, ffolkes1911 wrote:

Unfortunately did not reproduce identical initial error yet, however we had a new problem. Turns out an error in our project was leaking connections (not closing them) in certain scenarios. This gave me an idea and I tested a scenario which produces an error similar to the initial error:

  1. connect to ser2net via another device
  2. disconnect device from common network (unplug ethernet)
  3. connect to ser2net via different device (can be localhost)
  4. spam input/output to fill buffers

This results in output freezing on the second connection, admin console shows see device to tcp state: closing, new connections get some initial output, but cannot produce new output. In admin console I can still see the first connection, so I assume that ser2net attempts to send data there, but there's no response, buffers fill and everything freezes.

If this is indeed the same problem then fixing connection leaking in my project should fix these issues, however this can still raise problems in the future. I tried to find some option for tcp keepalive/timeout in gensio github and man pages with no success. Are there no such options in gensio and I would have to resort to changing system defaults (I assume that's what gensio uses) ?

That does make sense. I was wondering about all the connections.

It does set TCP keepalive on the connection by default (you can't turn it off). Or at least it should, I don't really have a testcase for that because it's pretty hard to do. I can try to do a hand test with a VM or something. The timeout on that is pretty long, 7200 seconds before starting keeplive. You can do "man 7 tcp" and search for keepalive for info. If there is data, the timeout is around 90 seconds, I believe.

I'm not sure how fast your things run, but those are long timeouts and the 90 second one cannot be changed.

But thinking about that some more, if the connection on your client's side is still open, keepalives won't help. As far as the OS is concerned the connection is open still; it won't time it out. So there's not much I can do about it. The buffer being full is not a condition where it will time it out. You could set the ser2net timeouts, but that's probably not what you want to do.

cminyard avatar Mar 19 '24 12:03 cminyard

I've been sick for most of last week, sorry for the delay.
Seems that the original issue was not reproduced with older ser2net and gensio, but as I mentioned that may be due to our connection leak fix, so I will go back again to latest ser2net and gensio versions and check if problem persists.

If there is data, the timeout is around 90 seconds, I believe

It seems that I had misunderstood how it works. Tested today (python TCP socket client -> ser2net server) and it appears that yes, if any data is being sent over TCP socket, even if receiver is unreachable, it resets the keepalive "idle" timer and connection does not close. Unsure about that timeout value, seems somewhat random between 90-960s on client laptop (all default values).

from ser2net.yaml (4.5.0)

MULTIPLE CONNECTIONS As mentioned earlier, you can set max-connections= on a port to allow more than one connection at a time to the same serial port. These connections will share all the settings. ...
timeout will be per TCP port and will only disconnect that TCP port on a timeout.

I have also tested ser2net timeout implementation, but the hanging connection does not get closed if any traffic is being sent to it (generated traffic from server side), so it seems to work similarly to TCP keepalive, and would maybe only work when buffers are full.

We will probably need to reduce default TCP keepalive values on the server, because it looks like a good practice on servers.

ffolkes1911 avatar Mar 25 '24 12:03 ffolkes1911

After going back to the new versions (ser2net 4.6.0), one port got stuck with hanging connections overnight. I have some python logs, but I later reproduced them when I maxed out connections, so it's most likely not the cause:

Python connection logs

(03-26 07:30:29) ERROR  : Serial connect failed 
(03-26 07:30:29) DEBUG           : Traceback (most recent call last):
  File ".../our_packages/serial_operations.py", line 66, in serial_connect
    self.CONN = serial.serial_for_url(
  File ".../lib/python3.10/site-packages/serial/__init__.py", line 90, in serial_for_url
    instance.open()
  File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 481, in open
    raise SerialException(
serial.serialutil.SerialException: Remote does not seem to support RFC2217 or BINARY mode [we-BINARY:True(ACTIVE), we-RFC2217:False(REQUESTED)]

....

(03-26 07:32:21) ERROR  : Serial connect failed 
(03-26 07:32:21) DEBUG           : Traceback (most recent call last):
  File ".../our_packages/serial_operations.py", line 66, in serial_connect
    self.CONN = serial.serial_for_url(
  File ".../lib/python3.10/site-packages/serial/__init__.py", line 90, in serial_for_url
    instance.open()
  File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 487, in open
    self._reconfigure_port()
  File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 515, in _reconfigure_port
    self._rfc2217_port_settings['baudrate'].set(struct.pack(b'!I', self._baudrate))
  File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 335, in set
    self.connection.rfc2217_send_subnegotiation(self.option, self.value)
  File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 867, in rfc2217_send_subnegotiation
    self._internal_raw_write(IAC + SB + COM_PORT_OPTION + option + value + IAC + SE)
  File ".../lib/python3.10/site-packages/serial/rfc2217.py", line 858, in _internal_raw_write
    self._socket.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

I will try to setup more python logging and see if we get more info.

ffolkes1911 avatar Mar 26 '24 06:03 ffolkes1911

Hi again, sorry for the delay. We've been increasing and analyzing logging, and I think we found the problem. It seems that there's a memleak in FastAPI, where upon raising any error inside an endpoint, none of the local scope variables are cleaned by GC, so our teardown did not close connections to ser2net occasionally. We fixed this on our end, and tested with initial ser2net version (4.6.0, built from source) - no problems, closing issue.

Thank you for the assistance and keep up the great work.

ffolkes1911 avatar Apr 22 '24 05:04 ffolkes1911