glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Transport endpoint is not connected

Open Franco-Sparrow opened this issue 1 year ago • 68 comments

Description of problem

This is a random problem related with gluster client disconnection and we cant reproduce it always, it occurs randomly (we guess this occur under heavy loads to the SDS). We have upgraded from gluster 8.4, passing through all versions of gluster 10.x and even with latest 10.5 we keep facing same problem. The mount point get a brief disconnection, and thats is fatal for an SDS providing service to VMs. This time the mount point automatically recovered itself, but that brief disconnection is enough to throw to I/O errors all VMs currently running in the node.

In this new version of gluster the problem was mitigated to only the affected volume. Before this, was required a reboot to the entire node, because affected all gluster mount points in the affected node. So, is the same base problem, but now different behavior. I know that Gluster Distributed Two ways Replicated is not the best solution, and with Replica 3 I might not face this problem on same way, because of the quorum and the protections against the node disconnections...but is there any way to fix this gluster client disconnection?

imagen

Expected results

Don't getting disconnection from the rest of the cluster

Mandatory info:**

The output of the gluster volume info command


Volume Name: vol2
Type: Distributed-Replicate
Volume ID: e1158040-4e60-4254-a281-e1125a27ba23
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: SERVER-N1:/data/glusterfs/vol2/brick0-1/data
Brick2: SERVER-N2:/data/glusterfs/vol2/brick0/data
Brick3: SERVER-N1:/data/glusterfs/vol2/brick1-1/data
Brick4: SERVER-N3:/data/glusterfs/vol2/brick0/data
Brick5: SERVER-N2:/data/glusterfs/vol2/brick1/data
Brick6: SERVER-N3:/data/glusterfs/vol2/brick1/data
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
features.shard: enable
features.shard-block-size: 5GB
cluster.favorite-child-policy: mtime
user.cifs: off
performance.read-ahead: off
performance.quick-read: off
performance.io-cache: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-gid: 9869
storage.owner-uid: 9869

The output of the gluster volume status command

Status of volume: vol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick SERVER-N1:/data/glusterfs/vol2/bric
k0-1/data                                   51756     0          Y       5341
Brick SERVER-N2:/data/glusterfs/vol2/bric
k0/data                                     56493     0          Y       7444
Brick SERVER-N1:/data/glusterfs/vol2/bric
k1-1/data                                   60464     0          Y       5373
Brick SERVER-N3:/data/glusterfs/vol2/bric
k0/data                                     54439     0          Y       7897
Brick SERVER-N2:/data/glusterfs/vol2/bric
k1/data                                     54583     0          Y       7476
Brick SERVER-N3:/data/glusterfs/vol2/bric
k1/data                                     49841     0          Y       7929
Self-heal Daemon on localhost               N/A       N/A        Y       5405
Self-heal Daemon on SERVER-N2             N/A       N/A        Y       7508
Self-heal Daemon on SERVER-N3             N/A       N/A        Y       7961

Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks

The output of the gluster volume heal command

gluster volume heal vol2 info
Brick SERVER-N1:/data/glusterfs/vol2/brick0-1/data
Status: Connected
Number of entries: 0

Brick SERVER-N2:/data/glusterfs/vol2/brick0/data
Status: Connected
Number of entries: 0

Brick SERVER-N1:/data/glusterfs/vol2/brick1-1/data
Status: Connected
Number of entries: 0

Brick SERVER-N3:/data/glusterfs/vol2/brick0/data
Status: Connected
Number of entries: 0

Brick SERVER-N2:/data/glusterfs/vol2/brick1/data
Status: Connected
Number of entries: 0

Brick SERVER-N3:/data/glusterfs/vol2/brick1/data
Status: Connected
Number of entries: 0

At the moment of writing this, there wasnt any entries on healing, but there was healing, reported by the monitoring system (Zabbix) and our custom checks for it:

imagen

Provide logs present on following locations of client and server nodes

No error on glusterd:

Is there any crash ? Provide the backtrace and coredump

My node4 is a gluster client and got disconnected from the cluster.

[2024-04-03 19:58:11.743421 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743441 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743457 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743530 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758e, unique = 445798451, sent = 2024-04-03 19:28:03.
739203 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743589 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758d, unique = 445798450, sent = 2024-04-03 19:28:03.
739161 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743613 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(WRITE(13)), xid = 0x8ac758c, unique = 445798448, sent = 2024-04-03 19:28:03.7390
78 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743664 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743684 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743858 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743888 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744034 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2024-04-03 19:28:12.301934 +0000 (xid=0x8ac759a)
[2024-04-03 19:58:11.744034 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(FSTAT(25)) called at 2024-04-03 19:28:03.738876 +0000 (xid=0x392c45a)
[2024-04-03 19:58:11.744155 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.383425 +0000 (xid=0x8ac759b)
[2024-04-03 19:58:11.744264 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744288 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.388829 +0000 (xid=0x8ac759c)
[2024-04-03 19:58:11.744288 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(READ(12)) called at 2024-04-03 19:28:03.741119 +0000 (xid=0x392c45b)
[2024-04-03 19:58:11.744360 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744452 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.401962 +0000 (xid=0x8ac759d)
[2024-04-03 19:58:11.744600 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(WRITE(13)) called at 2024-04-03 19:28:03.799048 +0000 (xid=0x392c45c)
[2024-04-03 19:58:11.744601 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(FINODELK(30)) called at 2024-04-03 19:28:19.437836 +0000 (xid=0x8ac759e)
[2024-04-03 19:58:11.744630 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744665 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-2: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744697 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-3: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744730 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2024-04-03 19:28:21.116111 +0000 (xid=0x8ac759f)
[2024-04-03 19:58:11.744727 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744777 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744871 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(FINODELK(30)) called at 2024-04-03 19:28:26.973849 +0000 (xid=0x8ac75a0)
[2024-04-03 19:58:11.744882 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(WRITE(13)) called at 2024-04-03 19:28:03.799479 +0000 (xid=0x392c45d)
[2024-04-03 19:58:11.744897 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744958 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-3: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.745003 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2024-04-03 19:28:39.624805 +0000 (xid=0x8ac75a1)

The operating system / glusterfs version

On each node:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

On server nodes:

glusterd --version
glusterfs 10.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

On node4 (client):

glusterfs --version
glusterfs 10.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Franco-Sparrow avatar Apr 03 '24 21:04 Franco-Sparrow

This is a recurrent issue, as we have faced very similar problems in the past. And as the OP mentioned above, my team and I have upgraded from 6.x, to 8.x and now with 10.5 we faced a very similar problem.

kCyborg avatar Apr 03 '24 21:04 kCyborg

Please share the full mount logs from the client machine where you observed this issue.

aravindavk avatar Apr 10 '24 17:04 aravindavk

@aravindavk Hi Sir, thanks for your attention. Please, check the following logs and lets us know if there is something that can fix this issue. This problem is being reiterative with our client and is getting anoying.

gluster_mount_v10.5_vol2.zip

These are the logs from the client that had the issue.

Franco-Sparrow avatar Apr 10 '24 19:04 Franco-Sparrow

@aravindavk Hi Sir

May we have a loop on this?

Franco-Sparrow avatar Apr 13 '24 05:04 Franco-Sparrow

@aravindavk Hi Sir

May we have a follow up on this?

Franco-Sparrow avatar May 05 '24 02:05 Franco-Sparrow

Any update?

Franco-Sparrow avatar Jun 03 '24 02:06 Franco-Sparrow

@aravindavk Hi Sir

I just could keep asking you forever. Could you please let me know if there is a solution for this issue or if is in a roadmap? I had faced again the issue last june 23rd.

Thanks in advance

Franco-Sparrow avatar Jun 26 '24 17:06 Franco-Sparrow

Hi,

unfortunately it happens to us too... just in June it happened twice... I cant see any clues why it happens (there is no heavy load, no lack of resources)... Only relevant information in /var/log/syslog is this:

2024-06-28T01:05:20.728589+02:00 s1-master mnt-glusterfs[440615]: pending frames:
2024-06-28T01:05:20.728718+02:00 s1-master mnt-glusterfs[440615]: frame : type(1) op(LK)
2024-06-28T01:05:20.728749+02:00 s1-master mnt-glusterfs[440615]: frame : type(1) op(LOOKUP)
2024-06-28T01:05:20.728770+02:00 s1-master mnt-glusterfs[440615]: frame : type(1) op(FSYNC)
2024-06-28T01:05:20.728790+02:00 s1-master mnt-glusterfs[440615]: patchset: git://git.gluster.org/glusterfs.git
2024-06-28T01:05:20.728811+02:00 s1-master mnt-glusterfs[440615]: signal received: 11
2024-06-28T01:05:20.728833+02:00 s1-master mnt-glusterfs[440615]: time of crash:
2024-06-28T01:05:20.729030+02:00 s1-master mnt-glusterfs[440615]: 2024-06-27 23:05:20 +0000
2024-06-28T01:05:20.729058+02:00 s1-master mnt-glusterfs[440615]: configuration details:
2024-06-28T01:05:20.729078+02:00 s1-master mnt-glusterfs[440615]: argp 1
2024-06-28T01:05:20.729105+02:00 s1-master mnt-glusterfs[440615]: backtrace 1
2024-06-28T01:05:20.729123+02:00 s1-master mnt-glusterfs[440615]: dlfcn 1
2024-06-28T01:05:20.729145+02:00 s1-master mnt-glusterfs[440615]: libpthread 1
2024-06-28T01:05:20.729169+02:00 s1-master mnt-glusterfs[440615]: llistxattr 1
2024-06-28T01:05:20.729191+02:00 s1-master mnt-glusterfs[440615]: setfsid 1
2024-06-28T01:05:20.729209+02:00 s1-master mnt-glusterfs[440615]: spinlock 1
2024-06-28T01:05:20.729228+02:00 s1-master mnt-glusterfs[440615]: epoll.h 1
2024-06-28T01:05:20.729247+02:00 s1-master mnt-glusterfs[440615]: xattr.h 1
2024-06-28T01:05:20.729265+02:00 s1-master mnt-glusterfs[440615]: st_atim.tv_nsec 1
2024-06-28T01:05:20.729285+02:00 s1-master mnt-glusterfs[440615]: package-string: glusterfs 9.2
2024-06-28T01:05:20.729303+02:00 s1-master mnt-glusterfs[440615]: ---------

I don't know why there is package-string "9.2", when glusterd --version returns "11.1"... it's weird...

I didn't find any interesting in other logs (I checked glusterd.log, glustershd.log, bricks/data.log)...

Any help will be appreciated...

Thank you.

Volume Name: myvolume
Type: Replicate
Volume ID: 69391668-ff10-4fef-b216-ea9a1b9dc760
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: instance1:/data
Brick2: instance2:/data
Brick3: instance3:/data (arbiter)
Options Reconfigured:
nfs.disable: on
cluster.lookup-optimize: off
cluster.self-heal-daemon: enable
performance.client-io-threads: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
glusterd --version
glusterfs 11.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
gluster volume status
Status of volume: myvolume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick instance1:/data                       54758     0          Y       57   
Brick instance2:/data                       53509     0          Y       57   
Brick instance3:/data                       55762     0          Y       57   
Self-heal Daemon on localhost               N/A       N/A        Y       89   
Self-heal Daemon on instance2               N/A       N/A        Y       91   
Self-heal Daemon on instance3               N/A       N/A        Y       91   
 
Task Status of Volume myvolume
------------------------------------------------------------------------------
There are no active volume tasks

thorewi avatar Jun 28 '24 07:06 thorewi

Hi,

unfortunately it happens to us too... just in June it happened twice... I cant see any clues why it happens (there is no heavy load, no lack of resources)... Only relevant information in /var/log/syslog is this:

2024-06-28T01:05:20.728589+02:00 s1-master mnt-glusterfs[440615]: pending frames:
2024-06-28T01:05:20.728718+02:00 s1-master mnt-glusterfs[440615]: frame : type(1) op(LK)
2024-06-28T01:05:20.728749+02:00 s1-master mnt-glusterfs[440615]: frame : type(1) op(LOOKUP)
2024-06-28T01:05:20.728770+02:00 s1-master mnt-glusterfs[440615]: frame : type(1) op(FSYNC)
2024-06-28T01:05:20.728790+02:00 s1-master mnt-glusterfs[440615]: patchset: git://git.gluster.org/glusterfs.git
2024-06-28T01:05:20.728811+02:00 s1-master mnt-glusterfs[440615]: signal received: 11
2024-06-28T01:05:20.728833+02:00 s1-master mnt-glusterfs[440615]: time of crash:
2024-06-28T01:05:20.729030+02:00 s1-master mnt-glusterfs[440615]: 2024-06-27 23:05:20 +0000
2024-06-28T01:05:20.729058+02:00 s1-master mnt-glusterfs[440615]: configuration details:
2024-06-28T01:05:20.729078+02:00 s1-master mnt-glusterfs[440615]: argp 1
2024-06-28T01:05:20.729105+02:00 s1-master mnt-glusterfs[440615]: backtrace 1
2024-06-28T01:05:20.729123+02:00 s1-master mnt-glusterfs[440615]: dlfcn 1
2024-06-28T01:05:20.729145+02:00 s1-master mnt-glusterfs[440615]: libpthread 1
2024-06-28T01:05:20.729169+02:00 s1-master mnt-glusterfs[440615]: llistxattr 1
2024-06-28T01:05:20.729191+02:00 s1-master mnt-glusterfs[440615]: setfsid 1
2024-06-28T01:05:20.729209+02:00 s1-master mnt-glusterfs[440615]: spinlock 1
2024-06-28T01:05:20.729228+02:00 s1-master mnt-glusterfs[440615]: epoll.h 1
2024-06-28T01:05:20.729247+02:00 s1-master mnt-glusterfs[440615]: xattr.h 1
2024-06-28T01:05:20.729265+02:00 s1-master mnt-glusterfs[440615]: st_atim.tv_nsec 1
2024-06-28T01:05:20.729285+02:00 s1-master mnt-glusterfs[440615]: package-string: glusterfs 9.2
2024-06-28T01:05:20.729303+02:00 s1-master mnt-glusterfs[440615]: ---------

I don't know why there is package-string "9.2", when glusterd --version returns "11.1"... it's weird...

I didn't find any interesting in other logs (I checked glusterd.log, glustershd.log, bricks/data.log)...

Any help will be appreciated...

Thank you.

Volume Name: myvolume
Type: Replicate
Volume ID: 69391668-ff10-4fef-b216-ea9a1b9dc760
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: instance1:/data
Brick2: instance2:/data
Brick3: instance3:/data (arbiter)
Options Reconfigured:
nfs.disable: on
cluster.lookup-optimize: off
cluster.self-heal-daemon: enable
performance.client-io-threads: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
glusterd --version
glusterfs 11.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
gluster volume status
Status of volume: myvolume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick instance1:/data                       54758     0          Y       57   
Brick instance2:/data                       53509     0          Y       57   
Brick instance3:/data                       55762     0          Y       57   
Self-heal Daemon on localhost               N/A       N/A        Y       89   
Self-heal Daemon on instance2               N/A       N/A        Y       91   
Self-heal Daemon on instance3               N/A       N/A        Y       91   
 
Task Status of Volume myvolume
------------------------------------------------------------------------------
There are no active volume tasks

Hi Sir

I am so sorry that you are on same situation as I am, but at the same time I am glad that you participate on this "forgotten" issue. Lets wait if @aravindavk comes with an answer to this issue. I have been waiting for one since months.

Franco-Sparrow avatar Jun 28 '24 17:06 Franco-Sparrow

@aravindavk Good day

May we have a follow up on this please?

Franco-Sparrow avatar Jul 23 '24 17:07 Franco-Sparrow

@Franco-Sparrow thanks for keeping up with this issue. We are also facing it without understanding its cause.

irigon avatar Jul 31 '24 10:07 irigon

Hi, just for completeness:

we are facing this issue every 2-3 weeks... we run glusterfs in docker

thorewi avatar Jul 31 '24 11:07 thorewi

Guys, I just gave up with gluster, this is a toy compared with ceph. My new clouds are using ceph, but this one specific is still using gluster and every 3 months has same issue without known anything @aravindavk .

I will keep open this issue as long as is not patched and some solution be shared with the community. Is not just me now.

Franco-Sparrow avatar Jul 31 '24 16:07 Franco-Sparrow

You are getting "Transport endpoint is not connected" error because the client is not able to get a ping response from the brick process within threshold time limit.

[2024-04-03 19:27:59.896047 +0000] I [MSGID: 133022] [shard.c:3834:shard_delete_shards] 0-vol2-shard: Deleted shards of gfid=28bae458-fe17-437f-a21c-0bab09eaec57 from backend [2024-04-03 19:28:46.941547 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol2-client-2: server 192.168.21.21:60464 has not responded in the last 42 seconds, disconnecting. [2024-04-03 19:28:50.485099 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol2-client-3: server 192.168.21.23:54439 has not responded in the last 42 seconds, disconnecting. [2024-04-03 19:29:01.088108 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol2-client-1: server 192.168.21.22:56493 has not responded in the last 42 seconds, disconnecting. [2024-04-03 19:29:03.116132 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol2-client-4: server 192.168.21.22:54583 has not responded in the last 42 seconds, disconnecting. [2024-04-03 19:29:08.054892 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol2-client-0: server 192.168.21.21:51756 has not responded in the last 42 seconds, disconnecting. [2024-04-03 19:29:23.878018 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-vol2-client-5: server 192.168.21.23:49841 has not responded in the last 42 seconds, disconnecting.

From the logs as you shared in description it is showing a client was throwing call_bail message here the maximum threshold limit is 1800s , It means the client was not able to get a response in 1800s that is too high.

[2024-04-03 19:58:11.743530 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758e, unique = 445798451, sent = 2024-04-03 19:28:03. 739203 +0000, timeout = 1800 for 192.168.21.22:54583 [2024-04-03 19:58:11.743589 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758d, unique = 445798450, sent = 2024-04-03 19:28:03. 739161 +0000, timeout = 1800 for 192.168.21.22:54583 [2024-04-03 19:58:11.743613 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(WRITE(13)), xid = 0x8ac758c, unique = 445798448, sent = 2024-04-03 19:28:03.7390 78 +0000, timeout = 1800 for 192.168.21.22:54583 [2024-04-03 19:58:11.743664 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}] [2024-04-03 19:58:11.743684 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}] [2024-04-03 19:58:11.743858 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}] [2024-04-03 19:58:11.743888 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]

It means on the server side the brick process is busy of facing some kind of deadlock issue or there is some network issue between client and server that;s why a client is not getting a response. Are you sure there is no issue in network, i think you should validate your network team also. If you are still facing similar issue kindly share client logs along brick logs.

mohit84 avatar Aug 01 '24 01:08 mohit84

Hi, Gluster team. We have this particular environment where we use the same server as a client to mount the shared filesystem. There is no issue on the network side because the server is accessible even during the whole mount-point freeze.

Env: 3 glusterfs (server and client), and only one gets frozen. but glusterfsd and bricks seem to be fine; there are no entries in healing.

We didn't find the cause, but we always start with a couple of entries to heal, and the freeze happens after all entries get healed.

nachowork90 avatar Aug 01 '24 02:08 nachowork90

Hi @mohit84 , thanks for your reply. @nachowork90 works in my team, is just as he said. Please, read his answer. There is no logic answer for a deadlock on the brick, or the timeout you are saying...client and server are on the same host, is an HCI environment.

Franco-Sparrow avatar Aug 01 '24 03:08 Franco-Sparrow

Hi, Gluster team. We have this particular environment where we use the same server as a client to mount the shared filesystem. There is no issue on the network side because the server is accessible even during the whole mount-point freeze.

Env: 3 glusterfs (server and client), and only one gets frozen. but glusterfsd and bricks seem to be fine; there are no entries in healing.

We didn't find the cause, but we always start with a couple of entries to heal, and the freeze happens after all entries get healed.

As we can see the logs are throwing ping_time_expiry messages it means the client is not getting response of a ping request within threshold time limit(42s), If you are still getting these messages before "transport endpoint is not connected" then reason is clear, either the bottleneck at brick level or network level otherwise it would not happen. You can generate stacktrace of client and server process while you are facing this kind of issue.

mohit84 avatar Aug 01 '24 03:08 mohit84

Hi @mohit84 By the time we notice the server is in trouble, we are able to ping the host. Network does not seems to be on bottleneck. Host resources are not overloaded and even with that we stop the backup service running on that network. The lock to the brick is not removed even when the host is pingable. If we have to restart glusterd service every time to ge free this lock this is not a feacible solution when you have running VMs using gluster as storage backend. The lock should be removed automatically, and the "Transport endpoint is not connected" should dissapear without the restart of glusterd or reboot of the node.

We are not the only ones facing this issue, I would request to @thorewi @irigon to share their logs too.

Franco-Sparrow avatar Aug 02 '24 14:08 Franco-Sparrow

Hi,

it seems our problem is different - in our logs there is SIGSEGV:

2024-06-28T01:05:20.728811+02:00 s1-master mnt-glusterfs[440615]: signal received: 11

maybe this https://github.com/gluster/glusterfs/pull/4384 would solve our problems, but it's not released yet...

I guess I cant provide more logs without running glusterfs in some debug mode with saving coredumps or something like that...

thorewi avatar Aug 02 '24 14:08 thorewi

@aravindavk Hi again

Our client had again an event this last midnight. Same error. When are we going to have a solution for this issue?

Franco-Sparrow avatar Sep 28 '24 14:09 Franco-Sparrow

We have the same thing:

Transport endpoint is not connected

This is a new system we're building and every few days a client fails to connect and we get a complete outage until we manually remount.

I can understand there being an intermittent connection issue (whether it's an internal or external issue) but shouldn't it recover? Other clients are still connected and working as expected.

Client logs:

024-12-15 22:24:35.335514 +0000] W [fuse-bridge.c:308:check_and_dump_fuse_W] 0-glusterfs-fuse: writing to fuse device yielded ENOENT 256 times
[2024-12-15 22:44:35.033134 +0000] W [fuse-bridge.c:308:check_and_dump_fuse_W] 0-glusterfs-fuse: writing to fuse device yielded ENOENT 256 times
[2024-12-15 23:02:58.478565 +0000] W [fuse-bridge.c:308:check_and_dump_fuse_W] 0-glusterfs-fuse: writing to fuse device yielded ENOENT 256 times
[2024-12-15 23:10:00.299921 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-librenms-glusterfs-client-0: server x:54818 has not responded in the last 42 seconds, disconnecting.
[2024-12-15 23:10:00.303761 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-librenms-glusterfs-client-1: server x:50541 has not responded in the last 42 seconds, disconnecting.
[2024-12-15 23:10:00.336200 +0000] C [rpc-clnt-ping.c:152:rpc_clnt_ping_timer_expired] 0-librenms-glusterfs-client-2: server x:49917 has not responded in the last 42 seconds, disconnecting.
[2024-12-15 23:10:01.843203 +0000] I [socket.c:3802:socket_submit_outgoing_msg] 0-librenms-glusterfs-client-0: not connected (priv->connected = -1)
[2024-12-15 23:10:01.843229 +0000] W [rpc-clnt.c:1709:rpc_clnt_submit] 0-librenms-glusterfs-client-0: failed to submit rpc-request (unique: 1042251796, XID: 0x26dd84de Program: GlusterFS 4.x v1, ProgVers: 400, Proc: 27) to rpc-transport
 (librenms-glusterfs-client-0)
[2024-12-15 23:10:01.843256 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2618:client4_0_lookup_cbk] 0-librenms-glusterfs-client-0: remote operation failed. [{path=/}, {gfid=00000000-0000-0000-0000-000000000001}, {errno=107}, {error=Tr
ansport endpoint is not connected}]

domleb avatar Dec 16 '24 14:12 domleb

@domleb - do you see anything on the same timeframe on the server side?

mykaul avatar Dec 16 '24 16:12 mykaul

Here are some server logs for the same time:

[2024-12-14 04:47:10.336641 +0000] I [addr.c:52:compare_addr_and_update] 0-/data/brick1/rrd: allowed = "*", received addr = "x"
[2024-12-14 04:47:10.337470 +0000] I [MSGID: 115029] [server-handshake.c:565:server_setvolume] 0-librenms-glusterfs-server: accepted client from CTX_ID:40839367-dc5e-4be4-a2ae-41d5ff42bbfb-GRAPH_ID:0-PID:2043541-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-2 (version: 10.5) with subvol /data/brick1/rrd
[2024-12-15 23:10:00.300234 +0000] W [socket.c:749:__socket_rwv] 0-tcp.librenms-glusterfs-server: readv on x:49143 failed (No data available)
[2024-12-15 23:10:00.301477 +0000] I [MSGID: 115036] [server.c:500:server_rpc_notify] 0-librenms-glusterfs-server: disconnecting connection [{client-uid=CTX_ID:6d1785c7-d174-4bde-98df-6fa04e5b9d6b-GRAPH_ID:0-PID:2046128-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-0}]
[2024-12-15 23:10:00.304961 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on fe2cb290-396c-41c4-8c4e-059f18e1f401 held by {client=0x7fac1c08f858, pid=0 lk-owner=986c14d9fc7e0000}
[2024-12-15 23:10:00.304976 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on 4800aef7-2329-4e66-8df4-3c81c7fb5168 held by {client=0x7fac1c08f858, pid=0 lk-owner=38ca5bd8fc7e0000}
[2024-12-15 23:10:00.304983 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on 94aeeb94-bac2-4b6b-b601-161a90ff760f held by {client=0x7fac1c08f858, pid=0 lk-owner=d8290fd8fc7e0000}
[2024-12-15 23:10:00.304989 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on a7dafc0b-2f44-44b2-ac18-2bddbd696028 held by {client=0x7fac1c08f858, pid=0 lk-owner=880e38d8fc7e0000}
[2024-12-15 23:10:00.304994 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on 0b59307c-6aec-4ca9-a140-117987e64eee held by {client=0x7fac1c08f858, pid=0 lk-owner=08721cd9fc7e0000}
[2024-12-15 23:10:00.305000 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on f98e5a34-cebf-4a78-b910-d852b4326678 held by {client=0x7fac1c08f858, pid=0 lk-owner=388f23d8fc7e0000}
[2024-12-15 23:10:00.305027 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/sensor-temperature-junos-jnxPMCurTemperature.579.rrd}]
[2024-12-15 23:10:00.305044 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/port-id9737.rrd}]
[2024-12-15 23:10:00.305057 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/port-id8954.rrd}]
[2024-12-15 23:10:00.305081 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/port-id9400.rrd}]
[2024-12-15 23:10:00.305095 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-cisco-qfp.rrd}]
[2024-12-15 23:10:00.305115 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/sensor-voltage-entity-sensor-100006112.rrd}]
[2024-12-15 23:10:00.305129 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/processor-junos-7.1.0.0.rrd}]
[2024-12-15 23:10:00.305146 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/port-id9739.rrd}]
[2024-12-15 23:10:00.305168 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-applications.rrd}]
[2024-12-15 23:10:00.305183 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-applications.rrd}]
[2024-12-15 23:10:00.305198 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-entity-physical.rrd}]
[2024-12-15 23:10:00.306046 +0000] I [MSGID: 101055] [client_t.c:396:gf_client_unref] 0-librenms-glusterfs-server: Shutting down connection CTX_ID:6d1785c7-d174-4bde-98df-6fa04e5b9d6b-GRAPH_ID:0-PID:2046128-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-0
[2024-12-15 23:16:23.788219 +0000] W [socket.c:749:__socket_rwv] 0-tcp.librenms-glusterfs-server: readv on x:49139 failed (No data available)
[2024-12-15 23:16:23.788307 +0000] I [MSGID: 115036] [server.c:500:server_rpc_notify] 0-librenms-glusterfs-server: disconnecting connection [{client-uid=CTX_ID:40839367-dc5e-4be4-a2ae-41d5ff42bbfb-GRAPH_ID:0-PID:2043541-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-2}]
[2024-12-15 23:16:23.788357 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on d798176c-cae4-4a1d-8baa-9b903aa083c2 held by {client=0x7fac240addc8, pid=0 lk-owner=68677cf0267f0000}
[2024-12-15 23:16:23.788364 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on 8d91c14c-65ae-41e8-8e2b-c45136115057 held by {client=0x7fac240addc8, pid=0 lk-owner=3882b8f0267f0000}
[2024-12-15 23:16:23.788370 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-librenms-glusterfs-server: releasing lock on 411f579d-6bfa-4382-bdc1-28c00cb34d09 held by {client=0x7fac240addc8, pid=0 lk-owner=987b74f0267f0000}
[2024-12-15 23:16:23.788390 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-applications.rrd}]
[2024-12-15 23:16:23.788418 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/port-id9081.rrd}]
[2024-12-15 23:16:23.788597 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/sensor-temperature-junos-jnxPMCurTemperature.525.rrd}]
[2024-12-15 23:16:23.788642 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-ucd-diskio.rrd}]
[2024-12-15 23:16:23.788658 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/poller-perf-wireless.rrd}]
[2024-12-15 23:16:23.788680 +0000] I [MSGID: 115013] [server-helpers.c:311:do_fd_cleanup] 0-librenms-glusterfs-server: fd cleanup [{path=x/sensor-temperature-junos-jnxPMCurTemperature.562.rrd}]
[2024-12-15 23:16:23.789687 +0000] I [MSGID: 101055] [client_t.c:396:gf_client_unref] 0-librenms-glusterfs-server: Shutting down connection CTX_ID:40839367-dc5e-4be4-a2ae-41d5ff42bbfb-GRAPH_ID:0-PID:2043541-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-2
[2024-12-15 23:45:45.119655 +0000] I [addr.c:52:compare_addr_and_update] 0-/data/brick1/rrd: allowed = "*", received addr = "x"
[2024-12-15 23:45:45.119940 +0000] I [MSGID: 115029] [server-handshake.c:565:server_setvolume] 0-librenms-glusterfs-server: accepted client from CTX_ID:40839367-dc5e-4be4-a2ae-41d5ff42bbfb-GRAPH_ID:0-PID:2043541-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-3 (version: 10.5) with subvol /data/brick1/rrd

domleb avatar Dec 16 '24 17:12 domleb

It's a bit confusing, but the reason I suspect we see no data is this snippet : https://github.com/gluster/glusterfs/blob/7663679e0b2d69ea954d59a42500f5beed001b3e/rpc/rpc-transport/socket/src/socket.c#L722-L729

It makes little sense to me. EOF should not be closing the connection, IMHO. Here's an example from Golang - zero read doesn't return EOF - https://github.com/golang/go/commit/5bcdd639331cd7f8d844fd38a674c4751423f938

So for some reason the server reads zero bytes from the client, and then it decides to close the connection. That's unfortunate. It doesn't explain why rpc_clnt_ping_timer_expired() on the client, though? Did they really disconnect regardless?

mykaul avatar Dec 17 '24 08:12 mykaul

Hi all.

Is really nice that more users are becoming aware of this issue. My solution was migrate to Ceph. It is sad, but we gave up with this SDS solution wich is not ready for production environments, like critical Cloud infrastructure. This is my advice for all of you that are facing same issue...try to move on to another SDS which does not use FS shared.

This issue has been here for months and they dont care or dont have time for fix it. Use Gluster just for development purposes.

Franco-Sparrow avatar Dec 17 '24 17:12 Franco-Sparrow

It doesn't explain why rpc_clnt_ping_timer_expired() on the client, though? Did they really disconnect regardless?

I don't have answers other than the mount remains failed until we remount it.

I also don't know the design around the client > server connection but I would expect it to self heal. If for some reason the connection is broken, then the clients should continue to attempt reconnections.

My solution was migrate to Ceph

We have now started looking at Ceph.

domleb avatar Dec 18 '24 15:12 domleb

It doesn't explain why rpc_clnt_ping_timer_expired() on the client, though? Did they really disconnect regardless?

I don't have answers other than the mount remains failed until we remount it.

@domleb - can you share a few lines of logs (100 or so) from the server side, prior to the disconnect? Perhaps the answer is there.

mykaul avatar Dec 18 '24 15:12 mykaul

can you share a few lines of logs (100 or so) from the server side, prior to the disconnect? Perhaps the answer is there.

Noting has been removed in the log above and looking at the timestamps the log entry immediately before is from ~18 hours prior so unrelated:

[2024-12-14 04:47:10.337470 +0000] I [MSGID: 115029] [server-handshake.c:565:server_setvolume] 0-librenms-glusterfs-server: accepted client from CTX_ID:40839367-dc5e-4be4-a2ae-41d5ff42bbfb-GRAPH_ID:0-PID:2043541-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-2 (version: 10.5) with subvol /data/brick1/rrd
[2024-12-15 23:10:00.300234 +0000] W [socket.c:749:__socket_rwv] 0-tcp.librenms-glusterfs-server: readv on x:49143 failed (No data available)
[2024-12-15 23:10:00.301477 +0000] I [MSGID: 115036] [server.c:500:server_rpc_notify] 0-librenms-glusterfs-server: disconnecting connection [{client-uid=CTX_ID:6d1785c7-d174-4bde-98df-6fa04e5b9d6b-GRAPH_ID:0-PID:2046128-HOST:x-PC_NAME:librenms-glusterfs-client-0-RECON_NO:-0}]

domleb avatar Dec 18 '24 16:12 domleb

I don't see anything on the RPC ping that can explain this, but I'm also unsure if the code I've pasted above (https://github.com/gluster/glusterfs/issues/4330#issuecomment-2547843535 ) is correct either. I'd raise the RPC ping timeout, at the very least, to very high values.

mykaul avatar Dec 19 '24 08:12 mykaul

Is it correct that the issue didn't exist in 8.x but started happening from 10.x? The logs are saying calls are bailing, if this never happened with 8.x and assuming there is a bug in 10.x this can happen only when there is a frame-loss or a bug that leads to deadlock on the brick side. When this issue happens again, will someone be able to send out the brick logs? This may take a while to debug, but would help start the debugging as the issue happens once in a while.

pranithk avatar Jan 22 '25 00:01 pranithk