Quotad is crashing
Description of problem:
Quotad is crashing:
gluster volume status write-cache Status of volume: write-cache Gluster process TCP Port RDMA Port Online Pid
Brick node1:/brick_nvme3n1 52701 0 Y 34269 Brick node2:/brick_nvme3n1 58104 0 Y 21606 Brick node3:/brick_nvme3n1 58586 0 Y 20830 Brick node1:/brick_nvme2n1 51605 0 Y 34284 Brick node2:/brick_nvme2n1 52717 0 Y 21621 Brick node3:/brick_nvme2n1 53454 0 Y 20845 Self-heal Daemon on localhost N/A N/A Y 34301 Quota Daemon on localhost N/A N/A N N/A Self-heal Daemon on node2 N/A N/A Y 21643 Quota Daemon on node2 N/A N/A N N/A Self-heal Daemon on node3 N/A N/A Y 20867 Quota Daemon on node3 N/A N/A Y 21233
The exact command to reproduce the issue: Not sure what produce crash I am merely writing files into a new gluster volume, and it crashes very frequently
Mandatory info:
- The output of the gluster volume info command:
gluster volume info write-cache
Volume Name: write-cache Type: Distributed-Replicate Volume ID: 6779f8a6-e3e6-4666-89b9-f64ec9e883d2 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: node1:/brick_nvme3n1 Brick2: node2:/brick_nvme3n1 Brick3: node3:/brick_nvme3n1 Brick4: node1:/brick_nvme2n1 Brick5: node2:/brick_nvme2n1 Brick6: node3:/brick_nvme2n1 Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.write-behind: off cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off
- The output of the gluster volume status command:
gluster volume status write-cache Status of volume: write-cache Gluster process TCP Port RDMA Port Online Pid
Brick node1:/brick_nvme3n1 52701 0 Y 34269 Brick node2:/brick_nvme3n1 58104 0 Y 21606 Brick node3:/brick_nvme3n1 58586 0 Y 20830 Brick node1:/brick_nvme2n1 51605 0 Y 34284 Brick node2:/brick_nvme2n1 52717 0 Y 21621 Brick node3:/brick_nvme2n1 53454 0 Y 20845 Self-heal Daemon on localhost N/A N/A Y 34301 Quota Daemon on localhost N/A N/A N N/A Self-heal Daemon on node2 N/A N/A Y 21643 Quota Daemon on node2 N/A N/A N N/A Self-heal Daemon on node3 N/A N/A Y 20867 Quota Daemon on node3 N/A N/A Y 21233
- The output of the gluster volume heal command:
gluster volume heal write-cache info
Brick node1:/brick_nvme3n1 Status: Connected Number of entries: 0
Brick node2:/brick_nvme3n1 Status: Connected Number of entries: 0
Brick node3:/brick_nvme3n1 Status: Connected Number of entries: 0
Brick node1:/brick_nvme2n1 Status: Connected Number of entries: 0
Brick node2:/brick_nvme2n1 Status: Connected Number of entries: 0
Brick node3:/brick_nvme2n1 Status: Connected Number of entries: 0
**- Provide logs present on following locations of client and server nodes - Logs are attached
**- Is there any crash ? Provide the backtrace and coredump Coredump is attahced
signal received: 11 time of crash: 2023-04-03 07:23:02 +0000 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 11.0 /lib64/libglusterfs.so.0(+0x28cf4)[0x7fe7d78cccf4] /lib64/libglusterfs.so.0(gf_print_trace+0x2f8)[0x7fe7d78d36b8] /lib64/libc.so.6(+0x4eb80)[0x7fe7d540ab80] /lib64/libtcmalloc_minimal.so.4(__libc_calloc+0x73)[0x7fe7d57a1cd3] /lib64/libglusterfs.so.0(__gf_calloc+0x50)[0x7fe7d78f0440] /lib64/libglusterfs.so.0(+0x36141)[0x7fe7d78da141] /lib64/libglusterfs.so.0(inode_new+0x1b)[0x7fe7d78db85b] /usr/lib64/glusterfs/11.0/xlator/features/quotad.so(+0x253e)[0x7fe7c100153e] /usr/lib64/glusterfs/11.0/xlator/features/quotad.so(+0x3a72)[0x7fe7c1002a72] /lib64/libgfrpc.so.0(+0x94f2)[0x7fe7d76734f2] /lib64/libgfrpc.so.0(+0x9b0e)[0x7fe7d7673b0e] /lib64/libgfrpc.so.0(rpc_transport_notify+0x2b)[0x7fe7d76752ab] /usr/lib64/glusterfs/11.0/rpc-transport/socket.so(+0x47dc)[0x7fe7c307e7dc] /usr/lib64/glusterfs/11.0/rpc-transport/socket.so(+0xb9ac)[0x7fe7c30859ac] /lib64/libglusterfs.so.0(+0x7fbbd)[0x7fe7d7923bbd] /lib64/libpthread.so.0(+0x81cf)[0x7fe7d60501cf] /lib64/libc.so.6(clone+0x43)[0x7fe7d53f5e73]
Additional info: On this systems I am using external gluster in kadalu native way and using GlusterFS directory quota to set capacity limitation for external gluster volumes
- The operating system / glusterfs version:
[root@node1 crash]# gluster --version glusterfs 11.0 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. https://www.gluster.org/ GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation.
[root@node1 crash]# cat /etc/os-release NAME="Rocky Linux" VERSION="8.7 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.7" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.7 (Green Obsidian)" ANSI_COLOR="0;32" LOGO="fedora-logo-icon" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8" ROCKY_SUPPORT_PRODUCT_VERSION="8.7" REDHAT_SUPPORT_PRODUCT="Rocky Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.7"
There's also a lot of error/warning like the following in the bricks log
[2023-04-03 07:48:02.742138 +0000] W [rpc-clnt.c:1628:rpc_clnt_submit] 0-write-cache-quota: error returned while attempting to connect to host:(null), port:0 [2023-04-03 07:48:07.898428 +0000] W [rpc-clnt.c:1628:rpc_clnt_submit] 0-write-cache-quota: error returned while attempting to connect to host:(null), port:0 The message "W [MSGID: 120022] [quota-enforcer-client.c:221:quota_enforcer_lookup_cbk] 0-write-cache-quota: Getting cluster-wide size of directory failed (path: /subvol/ea/e4/pvc-956a0eab-6a54-4157-b556-f7a548d6ce2c gfid:efdf67dd-521c-4a66-8999-ea7a92606604) [Transport endpoint is not connected]" repeated 22 times between [2023-04-03 07:46:15.076934 +0000] and [2023-04-03 07:48:07.898457 +0000] The message "W [MSGID: 120023] [quota-enforcer-client.c:300:_quota_enforcer_lookup] 0-write-cache-quota: Couldn't send the request to fetch cluster wide size of directory (path:/subvol/ea/e4/pvc-956a0eab-6a54-4157-b556-f7a548d6ce2c gfid:<EF><DF>g<DD>R^\Jf<89><99><EA>z<92>f^D)" repeated 22 times between [2023-04-03 07:46:15.076962 +0000] and [2023-04-03 07:48:07.898467 +0000]
The message "E [MSGID: 113001] [posix-helpers.c:1255:posix_handle_pair] 0-write-cache-posix: /brick_nvme2n1/subvol/ea/e4/pvc-956a0eab-6a54-4157-b556-f7a548d6ce2c: key:glusterfs.quota.total-usageflags: 0 length:11 [Operation not supported]" repeated 22 times between [2023-04-03 07:46:15.079201 +0000] and [2023-04-03 07:48:07.900798 +0000]
[2023-04-03 07:48:13.020291 +0000] W [rpc-clnt.c:1628:rpc_clnt_submit] 0-write-cache-quota: error returned while attempting to connect to host:(null), port:0
[2023-04-03 07:48:13.020311 +0000] W [MSGID: 120022] [quota-enforcer-client.c:221:quota_enforcer_lookup_cbk] 0-write-cache-quota: Getting cluster-wide size of directory failed (path: /subvol/ea/e4/pvc-956a0eab-6a54-4157-b556-f7a548d6ce2c gfid:efdf67dd-521c-4a66-8999-ea7a92606604) [Transport endpoint is not connected]
[2023-04-03 07:48:13.020329 +0000] W [MSGID: 120023] [quota-enforcer-client.c:300:_quota_enforcer_lookup] 0-write-cache-quota: Couldn't send the request to fetch cluster wide size of directory (path:/subvol/ea/e4/pvc-956a0eab-6a54-4157-b556-f7a548d6ce2c gfid:<EF><DF>g<DD>R^\Jf<89><99><EA>z<92>f^D) [2023-04-03 07:48:13.023300 +0000] E [MSGID: 113001] [posix-helpers.c:1255:posix_handle_pair] 0-write-cache-posix: /brick_nvme2n1/subvol/ea/e4/pvc-956a0eab-6a54-4157-b556-f7a548d6ce2c: key:glusterfs.quota.total-usageflags: 0 length:11 [Operation not supported]
@amarts Do we support both quota features(older and simple quota), it seems quotad is crashing during populate of simple quota xattrs ( glusterfs.quota.total-usage).
@handrea2009 Can you please share the output "thread apply all bt full" after attaching a coredump to gdb in your environment.
On glusterd side, we can disable simple-quota. Shouldn't use them together.
`[root@node1 ~]# gdb /usr/sbin/glusterfsd /var/crash/core-34763 GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from .gnu_debugdata for /usr/sbin/glusterfsd...(no debugging symbols found)...done. (no debugging symbols found)...done.
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: Can't open file (null) during file-backed mapping note processing
warning: core file may not match specified executable file. [New LWP 34769] [New LWP 34770] [New LWP 34768] [New LWP 34764] [New LWP 34763] [New LWP 34767] [New LWP 34765] [New LWP 34766] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/run/gluste'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fe7d57a1cd3 in tc_calloc () from /lib64/libtcmalloc_minimal.so.4 [Current thread is 1 (Thread 0x7fe7c29c7700 (LWP 34769))] Missing separate debuginfos, use: yum debuginfo-install glusterfs-fuse-11.0-1.el8s.x86_64 (gdb) (gdb) thread apply all bt full
Thread 8 (Thread 0x7fe7c4a9b700 (LWP 34766)): #0 0x00007fe7d60567aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fe7d7904355 in syncenv_task () from /lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007fe7d7904700 in syncenv_processor () from /lib64/libglusterfs.so.0 No symbol table info available. #3 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.
Thread 7 (Thread 0x7fe7c551c700 (LWP 34765)): #0 0x00007fe7d540b8dc in sigtimedwait () from /lib64/libc.so.6 No symbol table info available. #1 0x00007fe7d605a87c in sigwait () from /lib64/libpthread.so.0 No symbol table info available. #2 0x0000559ad592ff83 in glusterfs_sigwaiter () No symbol table info available. #3 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.
Thread 6 (Thread 0x7fe7c429a700 (LWP 34767)): #0 0x00007fe7d60567aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fe7d7904355 in syncenv_task () from /lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007fe7d7904700 in syncenv_processor () from /lib64/libglusterfs.so.0 No symbol table info available. #3 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.
Thread 5 (Thread 0x7fe7d7dc76c0 (LWP 34763)): #0 0x00007fe7d60516cd in __pthread_timedjoin_ex () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fe7d7923237 in event_dispatch_epoll () from /lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007fe7d7940625 in gf_io_run () from /lib64/libglusterfs.so.0 No symbol table info available. #3 0x0000559ad592c3b0 in main () No symbol table info available.
Thread 4 (Thread 0x7fe7c5d1d700 (LWP 34764)): #0 0x00007fe7d6056848 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 --Type <RET> for more, q to quit, c to continue without paging--c No symbol table info available. #1 0x00007fe7d78d9a01 in gf_timer_proc () from /lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.
Thread 3 (Thread 0x7fe7c3a99700 (LWP 34768)): #0 0x00007fe7d54e381f in select () from /lib64/libc.so.6 No symbol table info available. #1 0x00007fe7d793982a in runner () from /lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.
Thread 2 (Thread 0x7fe7c21c6700 (LWP 34770)): #0 0x00007fe7d7247320 in xdr_bytes () from /lib64/libtirpc.so.3 No symbol table info available. #1 0x00007fe7d7460211 in xdr_gfx_dict_pair () from /lib64/libgfxdr.so.0 No symbol table info available. #2 0x00007fe7d72485cf in xdr_array () from /lib64/libtirpc.so.3 No symbol table info available. #3 0x00007fe7d74606da in xdr_gfx_dict () from /lib64/libgfxdr.so.0 No symbol table info available. #4 0x00007fe7d7248f67 in xdr_sizeof () from /lib64/libtirpc.so.3 No symbol table info available. #5 0x00007fe7c17a18c1 in dict_to_xdr () from /usr/lib64/glusterfs/11.0/xlator/protocol/client.so No symbol table info available. #6 0x00007fe7c17a8d2c in client_pre_lookup_v2 () from /usr/lib64/glusterfs/11.0/xlator/protocol/client.so No symbol table info available. #7 0x00007fe7c178eead in client4_0_lookup () from /usr/lib64/glusterfs/11.0/xlator/protocol/client.so No symbol table info available. #8 0x00007fe7c17753a1 in client_lookup () from /usr/lib64/glusterfs/11.0/xlator/protocol/client.so No symbol table info available. #9 0x00007fe7c15332df in afr_discover_do () from /usr/lib64/glusterfs/11.0/xlator/cluster/replicate.so No symbol table info available. #10 0x00007fe7c1533943 in afr_discover () from /usr/lib64/glusterfs/11.0/xlator/cluster/replicate.so No symbol table info available. #11 0x00007fe7c153c6f4 in afr_lookup () from /usr/lib64/glusterfs/11.0/xlator/cluster/replicate.so No symbol table info available. #12 0x00007fe7c1285e39 in dht_lookup () from /usr/lib64/glusterfs/11.0/xlator/cluster/distribute.so No symbol table info available. #13 0x00007fe7c1001721 in qd_nameless_lookup () from /usr/lib64/glusterfs/11.0/xlator/features/quotad.so No symbol table info available. #14 0x00007fe7c1002a72 in quotad_aggregator_lookup () from /usr/lib64/glusterfs/11.0/xlator/features/quotad.so No symbol table info available. #15 0x00007fe7d76734f2 in rpcsvc_handle_rpc_call () from /lib64/libgfrpc.so.0 No symbol table info available. #16 0x00007fe7d7673b0e in rpcsvc_notify () from /lib64/libgfrpc.so.0 No symbol table info available. #17 0x00007fe7d76752ab in rpc_transport_notify () from /lib64/libgfrpc.so.0 No symbol table info available. #18 0x00007fe7c307e7dc in socket_event_poll_in_async () from /usr/lib64/glusterfs/11.0/rpc-transport/socket.so No symbol table info available. #19 0x00007fe7c30859ac in socket_event_handler () from /usr/lib64/glusterfs/11.0/rpc-transport/socket.so No symbol table info available. #20 0x00007fe7d7923bbd in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 No symbol table info available. #21 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #22 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.
Thread 1 (Thread 0x7fe7c29c7700 (LWP 34769)): #0 0x00007fe7d57a1cd3 in tc_calloc () from /lib64/libtcmalloc_minimal.so.4 No symbol table info available. #1 0x00007fe7d78f0440 in __gf_calloc () from /lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007fe7d78da141 in inode_create () from /lib64/libglusterfs.so.0 No symbol table info available. #3 0x00007fe7d78db85b in inode_new () from /lib64/libglusterfs.so.0 No symbol table info available. #4 0x00007fe7c100153e in qd_nameless_lookup () from /usr/lib64/glusterfs/11.0/xlator/features/quotad.so No symbol table info available. #5 0x00007fe7c1002a72 in quotad_aggregator_lookup () from /usr/lib64/glusterfs/11.0/xlator/features/quotad.so No symbol table info available. #6 0x00007fe7d76734f2 in rpcsvc_handle_rpc_call () from /lib64/libgfrpc.so.0 No symbol table info available. #7 0x00007fe7d7673b0e in rpcsvc_notify () from /lib64/libgfrpc.so.0 No symbol table info available. #8 0x00007fe7d76752ab in rpc_transport_notify () from /lib64/libgfrpc.so.0 No symbol table info available. #9 0x00007fe7c307e7dc in socket_event_poll_in_async () from /usr/lib64/glusterfs/11.0/rpc-transport/socket.so No symbol table info available. #10 0x00007fe7c30859ac in socket_event_handler () from /usr/lib64/glusterfs/11.0/rpc-transport/socket.so No symbol table info available. #11 0x00007fe7d7923bbd in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 No symbol table info available. #12 0x00007fe7d60501cf in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #13 0x00007fe7d53f5e73 in clone () from /lib64/libc.so.6 No symbol table info available.`
How can I disable simple-quota?
Hello. I'm new glusterfs user and had the same problem. Please tell us how we can disable simple-quota and remain quota working for volumes in general.
I have replicated (3 replicas) volume, and if 2 quotad on them crashed, the volume become unusable (i think, only for writing). Other volumes, where I did not configure quota, are working fine.
glusterfs version: 11.1
Since I also had this issue and took time to find which was in front of my eyes, I share that I found out that the simple-quota is a kadalu feature, so we can't disable it on the glusterfs server but in Kadalu. There is a troubleshooting guide which mentions "simple-quota" and it says we can disable it but only temporarily until the pods are restarted.
In the linked issue a deprecation of the quota feature is also mentioned but in "Red Hat Gluster Storage". I'm not sure about what the status of quota is in the open source glusterfs project.
The documentation also mentions gluster storage managed by Kubernetes and I assume this is the recommended way.
So for me, it looks like using an external Gluster Storage server is not recommended, but possible if we don't use quota. Otherwise it seems unstable.
The Kadalu documentation also mentions the Kadalu Storage project without Kubernetes. I assume it could be used instead of glusterd, but then why not configure it directly in Kubernetes? At least when it is an option, when we have access to the nodes and can add disks to the Kubernetes cluster nodes.
So this was what I think I understood, but it would be great if someone from GlusterFS could confirm.
Hi rimelek, I am also experiencing the same issue when using Kadalu with an external 3-node gluster with quotas enabled. However, if I use 2 nodes instead of 3, still with quotas enabled, I have no issues. Have you had a chance to learn anything more since you last wrote? I would be grateful if you could provide me with any additional information!
@GitHubRik Unfortunately, I have no more information that can help you fix this. We used Kadalu with for Glusterfs for a short time after the issue, but eventually, the whole Gluster cluster crashed and we could only restore data from backup and switch to Longhorn which we wanted to use anyway. Since then, I did not have time to work on the correct Glusterfs configuration, since we didn't really need it.
Hopefully someone will reply here as well, but until then, I guess you could try to contact the Gluster community https://www.gluster.org/community/
Hi rimelek, in the meantime, thank you very much for your reply. I'll try to evaluate some alternatives. Our prerogative is to use external storage for the Kubernetes cluster, but we don't want to choose Ceph. Kadalu with external Gluster seemed to be an optimal solution, but unfortunately the community has not been able to implement a stable solution with quotas enabled in order to achieve dynamic provisioning. If you have any suggestions, they would be greatly appreciated. Thank you.