Fluent-bit hanging and stopping operation
Bug Report
Describe the bug
From time to time, the fluent-bit application hangs and stops collecting logs. No activity is performed, there are no logs from fluent-bit itself, and metrics stop being generated. The Kubernetes pod does not crash or restart when this issue occurs, and it remains in the Ready state. This problem can only be detected manually by noticing that the application's logs have not been delivered.
To Reproduce
- Deploy helm chart to Kubernetes cluster
- Start collecting logs from application in cluster
- Wait for a problem
Expected behavior Stable operation and collection of all logs without hangs.
Your Environment
- Version used: chart version 0.47.9, app version 3.1.7
- Configuration:
[SERVICE]
Daemon Off
Flush {{ .Values.flush }}
Log_Level {{ .Values.logLevel }}
Parsers_File /fluent-bit/etc/parsers.conf
Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.metricsPort }}
Health_Check On
storage.sync normal
storage.checksum off
storage.path /var/log/flb-storage/
[INPUT]
Name systemd
Tag host.*
Read_From_Tail On
storage.type filesystem
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser cri
Skip_Long_Lines On
storage.type filesystem
[INPUT]
Name tail
Tag audit.*
Path /var/log/kube-audit/kube-audit.log
storage.type filesystem
[FILTER]
Name modify
Match host.*
Rename _HOSTNAME @HOSTNAME
Rename _MACHINE_ID @MACHINE_ID
Rename _BOOT_ID @BOOT_ID
Rename _CMDLINE @CMDLINE
Rename _COMM @COMM
Rename _EXE @EXE
Rename _PID @PID
Rename _UID @UID
Rename _GID @GID
Rename _CAP_EFFECTIVE @CAP_EFFECTIVE
Rename _SELINUX_CONTEXT @SELINUX_CONTEXT
Rename _AUDIT_LOGINUID @AUDIT_LOGINUID
Rename _AUDIT_SESSION @AUDIT_SESSION
Rename _SYSTEMD_CGROUP @SYSTEMD_CGROUP
Rename _SYSTEMD_SLICE @SYSTEMD_SLICE
Rename _SYSTEMD_UNIT @SYSTEMD_UNIT
Rename _SYSTEMD_SESSION @SYSTEMD_SESSION
Rename _SYSTEMD_OWNER_UID @SYSTEMD_OWNER_UID
Rename _TRANSPORT @TRANSPORT
Rename _KERNEL_DEVICE @KERNEL_DEVICE
Rename _KERNEL_SUBSYSTEM @KERNEL_SUBSYSTEM
Rename _UDEV_SYSNAME @UDEV_SYSNAME
Rename _UDEV_DEVNODE @UDEV_DEVNODE
Rename _UDEV_DEVLINK @UDEV_DEVLINK
Remove _SOURCE_REALTIME_TIMESTAMP
[FILTER]
Name parser
Match kube.*
Key_Name message
Parser json_custom
Reserve_Data True
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name lua
Match kube.*
script /fluent-bit/scripts/filters.lua
Call modify_log
[FILTER]
Name rewrite_tag
Match kube.*
Rule $es_index_date ^(daily)$ daily.$TAG false
[FILTER]
Name rewrite_tag
Match kube.*
Rule $es_index_date ^(weekly)$ weekly.$TAG false
[FILTER]
Name rewrite_tag
Match kube.*
Rule $es_index_date ^(monthly)$ monthly.$TAG false
[FILTER]
Name record_modifier
Match audit.*
Record cluster <cluster-name>
[OUTPUT]
Name es
Match host.*
Host ${ELASTICSEARCH_HOST}
Port ${ELASTICSEARCH_PORT}
HTTP_User ${ELASTICSEARCH_USER}
HTTP_Passwd ${ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix infra=journal
Logstash_DateFormat %Y-%m-%d
Time_Key @SOURCE_REALTIME_TIMESTAMP
Time_Key_Format %Y-%m-%dT%H:%M:%S
storage.total_limit_size 1G
[OUTPUT]
Name es
Match daily.kube.*
Host ${ELASTICSEARCH_HOST}
Port ${ELASTICSEARCH_PORT}
HTTP_User ${ELASTICSEARCH_USER}
HTTP_Passwd ${ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix_Key pre_index
Logstash_DateFormat %Y-%m-%d
Time_Key_Nanos On
storage.total_limit_size 1G
[OUTPUT]
Name es
Match weekly.kube.*
Host ${ELASTICSEARCH_HOST}
Port ${ELASTICSEARCH_PORT}
HTTP_User ${ELASTICSEARCH_USER}
HTTP_Passwd ${ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix_Key pre_index
Logstash_DateFormat %Y-w%W
Time_Key_Nanos On
storage.total_limit_size 1G
[OUTPUT]
Name es
Match monthly.kube.*
Host ${ELASTICSEARCH_HOST}
Port ${ELASTICSEARCH_PORT}
HTTP_User ${ELASTICSEARCH_USER}
HTTP_Passwd ${ELASTICSEARCH_PASSWORD}
Logstash_Format On
Logstash_Prefix_Key pre_index
Logstash_DateFormat %Y-m%m
Time_Key_Nanos On
storage.total_limit_size 1G
[OUTPUT]
name syslog
match audit.*
host <private-host>
port 15110
mode udp
syslog_format rfc3164
syslog_maxsize 2048
syslog_hostname_key cluster
syslog_message_key log
storage.total_limit_size 4G
[PARSER]
Name json_custom
Format json
filters.lua: |
function modify_log(tag, timestamp, record)
-- Add cluster name.
record.kubernetes.cluster = "{{ .Environment.Name }}"
-- Check that annotations exist.
if record["kubernetes"]["annotations"] then
-- Check that es-index-project annotation exists.
if record["kubernetes"]["annotations"]["example.com/es-index-project"] then
-- Check that es-index-project annotation contains a value
if string.len(record["kubernetes"]["annotations"]["example.com/es-index-project"]) > 0 then
es_index_project = record["kubernetes"]["annotations"]["example.com/es-index-project"]
else
-- If there is no value, then you need to assign fixme so as not to lose logs
es_index_project = "fixme"
end
-- Check that es-index-date annotation exists.
if record["kubernetes"]["annotations"]["example.com/es-index-date"] then
es_index_date = record["kubernetes"]["annotations"]["example.com/es-index-date"]
if es_index_date == "daily" then
record.es_index_date = es_index_date
elseif es_index_date == "weekly" then
record.es_index_date = es_index_date
elseif es_index_date == "monthly" then
record.es_index_date = es_index_date
else
-- Discard the log if es-index-date is not equal daily|weekly|monthly.
return -1, 0, 0
end
-- Check that es-index-ext annotation exists.
if record["kubernetes"]["annotations"]["example.com/es-index-ext"] then
es_index_ext = record["kubernetes"]["annotations"]["example.com/es-index-ext"]
-- Check that es-index-ext is not nil.
if string.len(es_index_ext) > 0 then
-- Create a prefix with es_index_ext key for the index.
record.pre_index = es_index_project .. "=k8s-" .. es_index_ext
end
else
-- Create a prefix without es_index_ext key for the index.
record.pre_index = es_index_project .. "=k8s"
end
else
-- Discard the log if es-index-date is not exists.
return -1, 0, 0
end
else
-- Discard the log if es-index-project is not exists.
return -1, 0, 0
end
else
-- Discard the log if es-index-project is not exists.
return -1, 0, 0
end
-- Return the modified log.
return 2, timestamp, record
end
- Environment name and version (e.g. Kubernetes? What version?): vanilla kubernetes v1.30.6
- Server type and version: VM on libvirt
- Operating System and version: CentOS 9 Stream 6.10.10-1.el9.elrepo.x86_64
- Filters and plugins: In config.
Additional context The issue does not occur frequently. It happens a few times per week on one of over 400 instances. It might occur more often, but we do not notice it due to the difficulty of detecting the problem.
I can confirm that I am experiencing the same issue with the same configuration and versions. The behavior is identical to what has been reported.
@Snak3GMS @nikitamishagin any logs that you can attach ?
Seems like behavior isn't exactly perfect, but these might all be related: https://github.com/fluent/fluent-bit/issues/9917 https://github.com/fluent/fluent-bit/issues/9906
These all exhibit behaviors of hanging
@pawel-lmcb
- strace
[pid 18] 00:12:40.898547 <... restart_syscall resumed>) = 0 <0.891804>
[pid 18] 00:12:40.898699 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 1] 00:12:41.064008 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000104>
[pid 1] 00:12:41.064069 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 18] 00:12:41.898923 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000146>
[pid 18] 00:12:41.899006 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 1] 00:12:42.064434 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000231>
[pid 1] 00:12:42.064566 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 16] 00:12:42.152983 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.498995>
[pid 15] 00:12:42.153083 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.499113>
[pid 14] 00:12:42.153209 <... epoll_wait resumed>[{EPOLLIN, {u32=1692426400, u64=139820057780384}}], 64, -1) = 1 <1.499217>
[pid 13] 00:12:42.153285 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.499286>
[pid 12] 00:12:42.153319 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.499237>
[pid 11] 00:12:42.153364 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.499269>
[pid 10] 00:12:42.153389 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.499238>
[pid 9] 00:12:42.153414 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.499250>
[pid 16] 00:12:42.153447 read(184, <unfinished ...>
[pid 15] 00:12:42.153461 read(180, <unfinished ...>
[pid 14] 00:12:42.153474 read(176, <unfinished ...>
[pid 13] 00:12:42.153501 read(172, <unfinished ...>
[pid 12] 00:12:42.153515 read(135, <unfinished ...>
[pid 11] 00:12:42.153534 read(137, <unfinished ...>
[pid 10] 00:12:42.153548 read(136, <unfinished ...>
[pid 16] 00:12:42.153574 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000111>
[pid 15] 00:12:42.153629 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000158>
[pid 14] 00:12:42.153650 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000167>
[pid 13] 00:12:42.153669 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000160>
[pid 12] 00:12:42.153690 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000166>
[pid 11] 00:12:42.153712 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000168>
[pid 9] 00:12:42.153734 read(134, <unfinished ...>
[pid 16] 00:12:42.153754 epoll_wait(161, <unfinished ...>
[pid 15] 00:12:42.153803 epoll_wait(158, <unfinished ...>
[pid 14] 00:12:42.153828 epoll_wait(149, <unfinished ...>
[pid 13] 00:12:42.153851 epoll_wait(140, <unfinished ...>
[pid 12] 00:12:42.153871 epoll_wait(121, <unfinished ...>
[pid 11] 00:12:42.153893 epoll_wait(118, <unfinished ...>
[pid 10] 00:12:42.153908 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000349>
[pid 16] 00:12:42.153966 <... epoll_wait resumed>[], 64, 0) = 0 <0.000189>
[pid 15] 00:12:42.153986 <... epoll_wait resumed>[], 64, 0) = 0 <0.000168>
[pid 14] 00:12:42.154003 <... epoll_wait resumed>[], 64, 0) = 0 <0.000159>
[pid 13] 00:12:42.154020 <... epoll_wait resumed>[], 64, 0) = 0 <0.000157>
[pid 12] 00:12:42.154037 <... epoll_wait resumed>[], 64, 0) = 0 <0.000152>
[pid 11] 00:12:42.154055 <... epoll_wait resumed>[], 64, 0) = 0 <0.000153>
[pid 9] 00:12:42.154072 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000329>
[pid 16] 00:12:42.154126 epoll_wait(161, <unfinished ...>
[pid 15] 00:12:42.154161 epoll_wait(158, <unfinished ...>
[pid 14] 00:12:42.154176 epoll_wait(149, <unfinished ...>
[pid 13] 00:12:42.154188 epoll_wait(140, <unfinished ...>
[pid 12] 00:12:42.154203 epoll_wait(121, <unfinished ...>
[pid 11] 00:12:42.154216 epoll_wait(118, <unfinished ...>
[pid 10] 00:12:42.154229 epoll_wait(109, <unfinished ...>
[pid 9] 00:12:42.154242 epoll_wait(106, <unfinished ...>
[pid 10] 00:12:42.154258 <... epoll_wait resumed>[], 64, 0) = 0 <0.000020>
[pid 9] 00:12:42.154278 <... epoll_wait resumed>[], 64, 0) = 0 <0.000027>
[pid 10] 00:12:42.154297 epoll_wait(109, <unfinished ...>
[pid 9] 00:12:42.154313 epoll_wait(106, <unfinished ...>
[pid 19] 00:12:42.652866 <... epoll_wait resumed>[{EPOLLIN, {u32=1644290128, u64=139820009644112}}], 256, -1) = 1 <2.646141>
[pid 19] 00:12:42.652957 read(202, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000025>
[pid 19] 00:12:42.653034 epoll_wait(196, <unfinished ...>
[pid 18] 00:12:42.899288 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000156>
[pid 18] 00:12:42.899357 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 1] 00:12:43.064814 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000198>
[pid 1] 00:12:43.064876 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 15] 00:12:43.652891 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.498676>
[pid 13] 00:12:43.653018 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.498813>
[pid 11] 00:12:43.653115 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.498886>
[pid 9] 00:12:43.653176 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.498851>
[pid 16] 00:12:43.653209 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.499074>
[pid 15] 00:12:43.653234 read(180, <unfinished ...>
[pid 14] 00:12:43.653252 <... epoll_wait resumed>[{EPOLLIN, {u32=1692426400, u64=139820057780384}}], 64, -1) = 1 <1.499068>
[pid 13] 00:12:43.653276 read(172, <unfinished ...>
[pid 12] 00:12:43.653292 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.499079>
[pid 11] 00:12:43.653315 read(137, <unfinished ...>
[pid 10] 00:12:43.653328 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.499021>
[pid 16] 00:12:43.653356 read(184, <unfinished ...>
[pid 15] 00:12:43.653372 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000127>
[pid 14] 00:12:43.653395 read(176, <unfinished ...>
[pid 13] 00:12:43.653409 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000124>
[pid 11] 00:12:43.653429 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000105>
[pid 9] 00:12:43.653450 read(134, <unfinished ...>
[pid 16] 00:12:43.653474 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000108>
[pid 15] 00:12:43.653504 epoll_wait(158, <unfinished ...>
[pid 14] 00:12:43.653567 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000161>
[pid 13] 00:12:43.653590 epoll_wait(140, <unfinished ...>
[pid 12] 00:12:43.653603 read(135, <unfinished ...>
[pid 11] 00:12:43.653617 epoll_wait(118, <unfinished ...>
[pid 10] 00:12:43.653631 read(136, <unfinished ...>
[pid 9] 00:12:43.653641 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000182>
[pid 16] 00:12:43.653672 epoll_wait(161, <unfinished ...>
[pid 15] 00:12:43.653685 <... epoll_wait resumed>[], 64, 0) = 0 <0.000171>
[pid 14] 00:12:43.653702 epoll_wait(149, <unfinished ...>
[pid 13] 00:12:43.653717 <... epoll_wait resumed>[], 64, 0) = 0 <0.000118>
[pid 12] 00:12:43.653732 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000121>
[pid 11] 00:12:43.653754 <... epoll_wait resumed>[], 64, 0) = 0 <0.000127>
[pid 10] 00:12:43.653801 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000162>
[pid 9] 00:12:43.653825 epoll_wait(106, <unfinished ...>
[pid 16] 00:12:43.653848 <... epoll_wait resumed>[], 64, 0) = 0 <0.000167>
[pid 15] 00:12:43.653865 epoll_wait(158, <unfinished ...>
[pid 14] 00:12:43.653878 <... epoll_wait resumed>[], 64, 0) = 0 <0.000166>
[pid 13] 00:12:43.653894 epoll_wait(140, <unfinished ...>
[pid 12] 00:12:43.653908 epoll_wait(121, <unfinished ...>
[pid 11] 00:12:43.653922 epoll_wait(118, <unfinished ...>
[pid 10] 00:12:43.653938 epoll_wait(109, <unfinished ...>
[pid 9] 00:12:43.653952 <... epoll_wait resumed>[], 64, 0) = 0 <0.000118>
[pid 16] 00:12:43.653974 epoll_wait(161, <unfinished ...>
[pid 14] 00:12:43.653986 epoll_wait(149, <unfinished ...>
[pid 12] 00:12:43.653998 <... epoll_wait resumed>[], 64, 0) = 0 <0.000081>
[pid 10] 00:12:43.654015 <... epoll_wait resumed>[], 64, 0) = 0 <0.000069>
[pid 12] 00:12:43.654038 epoll_wait(121, <unfinished ...>
[pid 9] 00:12:43.654051 epoll_wait(106, <unfinished ...>
[pid 10] 00:12:43.654067 epoll_wait(109, <unfinished ...>
[pid 18] 00:12:43.899534 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000131>
[pid 18] 00:12:43.899609 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 1] 00:12:44.065226 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000236>
[pid 1] 00:12:44.065374 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 19] 00:12:44.241955 <... epoll_wait resumed>[{EPOLLIN, {u32=1644212352, u64=139820009566336}}], 256, -1) = 1 <1.588873>
[pid 19] 00:12:44.242051 accept4(199, {sa_family=AF_INET, sin_port=htons(33386), sin_addr=inet_addr("10.99.10.35")}, [16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 46 <0.000034>
[pid 19] 00:12:44.242162 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000017>
[pid 19] 00:12:44.242232 epoll_wait(196, [{EPOLLIN, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000014>
[pid 19] 00:12:44.242305 recvfrom(46, "GET /api/v1/metrics/prometheus H"..., 4096, 0, NULL, NULL) = 1733 <0.000017>
[pid 19] 00:12:44.242478 newfstatat(AT_FDCWD, "/api/v1/metrics/prometheus", 0x7f2a62bfafd0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) <0.000034>
[pid 19] 00:12:44.242809 writev(46, [{iov_base="HTTP/1.1 200 OK\r\n", iov_len=17}, {iov_base="Server: Monkey/1.7.2\r\nDate: Sat,"..., iov_len=59}, {iov_base="Transfer-Encoding: chunked\r\n", iov_len=28}], 3) = 104 <0.000050>
[pid 19] 00:12:44.242926 writev(46, [{iov_base="Content-Type: text/plain; versio"..., iov_len=41}, {iov_base="\r\n", iov_len=2}], 2) = 43 <0.000034>
[pid 19] 00:12:44.243018 sendto(46, "2255\r\n", 6, 0, NULL, 0) = 6 <0.000054>
[pid 19] 00:12:44.243136 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000019>
[pid 19] 00:12:44.243221 sendto(46, "# HELP fluentbit_filter_add_reco"..., 8789, 0, NULL, 0) = 8789 <0.000075>
[pid 19] 00:12:44.243357 sendto(46, "\r\n", 2, 0, NULL, 0) = 2 <0.000019>
[pid 19] 00:12:44.243430 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000014>
[pid 19] 00:12:44.243504 epoll_wait(196, [{EPOLLOUT, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000016>
[pid 19] 00:12:44.243588 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000045>
[pid 19] 00:12:44.243699 sendto(46, "0\r\n\r\n", 5, 0, NULL, 0) = 5 <0.000015>
[pid 19] 00:12:44.243811 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000021>
[pid 19] 00:12:44.243883 epoll_wait(196, [{EPOLLOUT, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000029>
[pid 19] 00:12:44.243958 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000015>
[pid 19] 00:12:44.244051 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000015>
[pid 19] 00:12:44.244122 epoll_wait(196, <unfinished ...>
[pid 18] 00:12:44.899854 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000158>
[pid 18] 00:12:44.899931 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 1] 00:12:45.065666 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000178>
[pid 1] 00:12:45.065735 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, <unfinished ...>
[pid 16] 00:12:45.154860 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.500826>
[pid 15] 00:12:45.154910 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.501036>
[pid 13] 00:12:45.154936 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.501033>
[pid 12] 00:12:45.154972 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.500919>
[pid 11] 00:12:45.154999 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.501068>
[pid 10] 00:12:45.155025 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.500950>
[pid 9] 00:12:45.155050 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.500989>
- Curl request to ‘/api/v1/storage’
{
"storage_layer": {
"chunks": {
"total_chunks": 4,
"mem_chunks": 4,
"fs_chunks": 0,
"fs_chunks_up": 0,
"fs_chunks_down": 0
}
},
"input_chunks": {
"systemd.0": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "0b"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"tail.1": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "0b"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"tail.2": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "0b"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"storage_backlog.3": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "0b"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"emitter_for_rewrite_tag.4": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "9.5M"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"emitter_for_rewrite_tag.5": {
"status": {
"overlimit": false,
"mem_size": "22.2K",
"mem_limit": "9.5M"
},
"chunks": {
"total": 4,
"up": 4,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"emitter_for_rewrite_tag.6": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "9.5M"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
}
}
}
- Pod describe(Current state)
State: Running
Started: Mon, 30 Dec 2024 03:39:27 +0300
Ready: True
Restart Count: 2
Limits:
cpu: 200m
memory: 128Mi
Requests:
cpu: 200m
memory: 128Mi
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
- Last logs (stuck about a month ago)
[2025/01/09 15:24:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=337644498 watch_fd=4787 name=/var/log/containers/xxx-5784f8cbcd-2txq5_xxx_migrations-4b50a78e51a4e07aa11245e624fc4e67bfd0d4b34edb965388eb0eaf05c2dca5.log
[2025/01/09 15:24:28] [ warn] [engine] failed to flush chunk '1-1736436267.677796148.flb', retry in 9 seconds: task_id=7, input=emitter_for_rewrite_tag.6 > output=es.3 (out_id=3)
[2025/01/09 15:24:37] [error] [engine] chunk '1-1736436267.677796148.flb' cannot be retried: task_id=7, input=emitter_for_rewrite_tag.6 > output=es.3
[2025/01/09 15:25:26] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=433188932 watch_fd=4506
[2025/01/09 15:25:27] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=370864192 watch_fd=4775
[2025/01/09 15:25:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=332235753 watch_fd=4509
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=296558453 watch_fd=4756
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=449752668 watch_fd=4500
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=386694775 watch_fd=4748
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=72887180 watch_fd=4542
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=5697947 watch_fd=4383
[2025/01/09 15:27:27] [ info] [filter:kubernetes:kubernetes.2] token updated
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=391506156 watch_fd=4788 name=/var/log/containers/xxx-b57647789-gfbvk_xxx-static-dc5b651c143f9e36af9c19da6cde65e55d252417df909c551aad241f4aef5a64.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=223083475 watch_fd=4789 name=/var/log/containers/xxx-7b8c7bd55f-bn7ch_xxx-server-6520de96062c8f54116157f179ab0c8460c31ac6da042a33fd5f0cb2c59b57f7.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=400043213 watch_fd=4790 name=/var/log/containers/xxx-bob-7b8c7bd55f-bn7ch_xxx-xstatic-2a429bca73887c93233634ca80462272fc2669bb6cdf95da84c687d6dbab4ff9.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=244744155 watch_fd=4791 name=/var/log/containers/xxx-bob-7b8c7bd55f-rqxh2_xxx-xnews-server-5c8fc7ae42b6924bb0c2bfe9ed9ea4c4f82d46907696bfd9262f99f042b0f0bf.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=386711596 watch_fd=4792 name=/var/log/containers/xxx-bob-7b8c7bd55f-rqxh2_xxx-xstatic-e0ba58f5aca4b17063df13b4efd2018ecad2f896fdda82a612fb5875b6f01b22.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=213421478 watch_fd=4793 name=/var/log/containers/xxx-bob-b57647789-gfbvk_xxx-xserver-1c25bac499bfecc754d112857ec94805ced64f35db604af02c1d6a8c5c9b6423.log
[2025/01/09 15:27:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=391506156 watch_fd=4788
[2025/01/09 15:27:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=213421478 watch_fd=4793
these seem similar to #9903 as well, could you also check the cpu usage when fluent-bit hangs?
@AzureLeMoon, ok. I will record the cpu usage and send it when the problem repeats
@pawel-lmcb, there is nothing unusual in the logs. They look something like this:
[2025/02/10 07:28:51] [ warn] [engine] failed to flush chunk '1-1739172530.835782756.flb', retry in 7 seconds: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 07:28:58] [ info] [engine] flush chunk '1-1739172530.835782756.flb' succeeded at retry 1: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:00:46] [ warn] [engine] failed to flush chunk '1-1739174446.210853217.flb', retry in 8 seconds: task_id=0, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:00:54] [ info] [engine] flush chunk '1-1739174446.210853217.flb' succeeded at retry 1: task_id=0, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:04:53] [ warn] [engine] failed to flush chunk '1-1739174693.685714426.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:05:03] [ info] [engine] flush chunk '1-1739174693.685714426.flb' succeeded at retry 1: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
And when a problem occurs the logging just stops.
@AzureLeMoon Below is the CPU usage from the same host where the fluent-bit pod logs were taken. Fluent-bit pod is still stuck on this host since January 9th.
[xxx@xxx containers]$ mpstat 1 5
Linux 6.10.10-1.el9.elrepo.x86_64 (xxx) 02/10/2025 _x86_64_ (16 CPU)
12:02:14 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:02:15 PM all 8.98 0.00 1.32 0.00 0.00 0.06 0.00 0.00 0.00 89.64
12:02:16 PM all 7.76 0.00 0.94 0.00 0.00 0.25 0.00 0.00 0.00 91.05
12:02:17 PM all 5.47 0.00 0.82 0.00 0.00 0.06 0.00 0.00 0.00 93.65
12:02:18 PM all 6.03 0.00 1.38 0.00 0.00 0.06 0.00 0.00 0.00 92.52
12:02:19 PM all 7.83 0.00 2.19 0.00 0.00 0.06 0.00 0.00 0.00 89.91
Average: all 7.22 0.00 1.33 0.00 0.00 0.10 0.00 0.00 0.00 91.35
kubectl top pod fluent-bit-5gx94 --namespace kube-system
NAME CPU(cores) MEMORY(bytes)
fluent-bit-5gx94 1m 106Mi
Bumping also this, looks similar https://github.com/fluent/fluent-bit/issues/9557
Bumping also this, looks similar https://github.com/fluent/fluent-bit/issues/9557
This appears to be the exact problem i have as described in #9903
same here
@AzureLeMoon, I've observed the cpu consumption of hung containers. It tends to zero (<1mcpu).
Also on one of the hanging containers strace shows this:
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
then the lines repeat endlessly.
similar strace output as in https://github.com/fluent/fluent-bit/issues/9917
running fluent-bit v2.2.3, cannot reproduce this issue.
Exactly same here, similar config and fluent-bit v3.2.5. There are some pods of our daemonest that got stuck without collecting new logs.
CPU and memory consumption are below their limits
We realized because there were some pods which we couldn't check their logs in elastic. As @nikitamishagin issue, the pod is ready and no restarts despite having it's liveness probe enabled (we're deploying fluentbit using fluent-operator v3.3.0 helm chart).
Our logs seems pretty normal:
[2025/04/16 09:27:24] [ info] [input] pausing tail.1 [2025/04/16 09:27:25] [ info] [input] pausing systemd.0 [2025/04/16 09:27:25] [ info] [input] pausing tail.1 [2025/04/16 09:27:26] [ info] [input] pausing systemd.0 [2025/04/16 09:27:26] [ info] [input] pausing tail.1 [2025/04/16 09:27:27] [ info] [input] pausing systemd.0 [2025/04/16 09:27:27] [ info] [input] pausing tail.1 [2025/04/16 09:27:28] [ info] [input] pausing systemd.0 [2025/04/16 09:27:28] [ info] [input] pausing tail.1 [2025/04/16 09:27:29] [ info] [input] pausing systemd.0
We have the same issue, any news about this?
Would it be possible for someone in this thread to provide a smaller repro case or help me in the reproduction process?
Would it be possible for someone in this thread to provide a smaller repro case or help me in the reproduction process?
Hey @leonardo-albertovich , i would be happy to help. Could you please specify what is required?
In order to analyze the issue I need to reproduce it so I need to find a repro case that can be minimized. The original one is a bit convoluted so if anyone else has a different configuration that reproduces the issue (even if not frequently) I'd like to take a look at it.
Especially because I can use that to find common grounds for the issue and narrow down the settings I use to try to reproduce the issue and parts of the system I need to inspect.
Would it be possible for someone in this thread to provide a smaller repro case or help me in the reproduction process?
Sorry, but that's all the information I have.
The problem was reproducing sometimes and for no apparent reason. I think it may be linked with elasticsearch output module. Also, you should generate enough workload so that fluent-bit continuously sends data to elasticsearch.
We cannot reproduce the issue but we fall into it when the OpenShift/Kubernetes API was not available. Inspired by this GitHub issue, we created a own livenessProbe as a workaround:
livenessProbe:
exec:
# The following check monitors the Fluent Bit metric "fluentbit_input_records_total{name="tail.0"}" to ensure log intake is progressing as expected.
# - It checks (and initializes, if needed) a reference file for the last recorded input count.
# - It fetches the current count from the Fluent Bit Prometheus endpoint.
# - If the new count is less than or equal to the previous value, it reports an error
# and exits (which may indicate issues with log collection or Kubernetes API).
# - If the count is valid (>=0), it updates the reference file for the next check.
command:
- /bin/bash
- -exc
- >
[ -e ./lastInputRecordsTotal ] || echo 0 > ./lastInputRecordsTotal;
function getInputRecordsTotal() {
set -o pipefail; curl -s --connect-timeout 5 --retry 3 -f http://localhost:2020/api/v1/metrics/prometheus | grep '^fluentbit_input_records_total{name="tail.0"}' | awk '{print $2}' || echo 0;
};
T=$(getInputRecordsTotal); test -e ./lastInputRecordsTotal && { test $(cat ./lastInputRecordsTotal) -ge $T && { echo "The input records is too low, maybe there are some issue with the Kubernetes API (current input records = $T)"; exit 1;} ;}; test $T -ge 0 && { echo "$T" >| ./lastInputRecordsTotal; } || true;
initialDelaySeconds: 300
periodSeconds: 60
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 5
In order to analyze the issue I need to reproduce it so I need to find a repro case that can be minimized. The original one is a bit convoluted so if anyone else has a different configuration that reproduces the issue (even if not frequently) I'd like to take a look at it.
Especially because I can use that to find common grounds for the issue and narrow down the settings I use to try to reproduce the issue and parts of the system I need to inspect.
i made the following repo you could use to reproduce the hanging issue as described in #9903 , and i believe it should relate to this issue as well.
@leonardo-albertovich let me know if you need more data.
We cannot reproduce the issue but we fall into it when the OpenShift/Kubernetes API was not available. Inspired by this GitHub issue, we created a own livenessProbe as a workaround:
livenessProbe: exec: # The following check monitors the Fluent Bit metric "fluentbit_input_records_total{name="tail.0"}" to ensure log intake is progressing as expected. # - It checks (and initializes, if needed) a reference file for the last recorded input count. # - It fetches the current count from the Fluent Bit Prometheus endpoint. # - If the new count is less than or equal to the previous value, it reports an error # and exits (which may indicate issues with log collection or Kubernetes API). # - If the count is valid (>=0), it updates the reference file for the next check. command: - /bin/bash - -exc - > [ -e ./lastInputRecordsTotal ] || echo 0 > ./lastInputRecordsTotal; function getInputRecordsTotal() { set -o pipefail; curl -s --connect-timeout 5 --retry 3 -f http://localhost:2020/api/v1/metrics/prometheus | grep '^fluentbit_input_records_total{name="tail.0"}' | awk '{print $2}' || echo 0; }; T=$(getInputRecordsTotal); test -e ./lastInputRecordsTotal && { test $(cat ./lastInputRecordsTotal) -ge $T && { echo "The input records is too low, maybe there are some issue with the Kubernetes API (current input records = $T)"; exit 1;} ;}; test $T -ge 0 && { echo "$T" >| ./lastInputRecordsTotal; } || true; initialDelaySeconds: 300 periodSeconds: 60 successThreshold: 1 failureThreshold: 5 timeoutSeconds: 5
@R-Studio Thanks for sharing. You inspired me to reach a different livenessprobe, with a different goal. My goal is ensuring that our output doesn't hangs, in your example you're just checking that input records are being processed. However, the main goal is ensuring that Fluent Bit is still forwarding logs to its configured output. I'm sharing it below:
exec:
command:
- /busybox/sh
- -ec
- |
state=/tmp/lastOutputBytesTotal
[ -f "$state" ] || echo 0 >"$state"
# Sumamos la columna 2 (valor) de todas las líneas de la métrica
total=$(
wget -qO- --timeout=5 http://localhost:2020/api/v1/metrics/prometheus \
| grep '^fluentbit_output_proc_bytes_total{name="es\.' \
| awk '{sum+=$2} END{print sum+0}' || echo 0
)
last=$(cat "$state")
if [ "$total" -le "$last" ]; then
echo "Bytes procesados estancados: $total (antes $last)"
exit 1
fi
echo "$total" >"$state"
initialDelaySeconds: 300
periodSeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
I also seem to be running into this issue (or something similar) when using fluent-operator. Just adding some information in case it helps.
I am running as a daemonset and I am using the istio ambient service mesh. In all instances I have been able to check, it happens during startup. It appears that if it cannot successfully export in the first ~20 seconds then it gets into this stuck state. Manually restarting the pod gets it working again and it backfills logs.
I also use the opentelemetry plugin and it has the same problem of getting stuck and not sending metrics.
One thing I noticed is that the uptime value is not consistent between APIs.
http://localhost:2020/api/v2/metrics/prometheus
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime{hostname="fluent-bit-jbg6x"} 35
http://localhost:2020/api/v1/uptime
{"uptime_sec":30138,"uptime_hr":"Fluent Bit has been running: 0 day, 8 hours, 22 minutes and 18 seconds"}
I am going to try one of the livenessProbe workarounds in the meantime.
@R-Studio @joseeetm I am curious what image you are using that lets you run shell commands as the normal image is distroless. Are you using the -debug image or maybe something custom? The -debug image is very large and I can't use it in my environment.
Hi @cosmo0920 and @edsiper as Leonardo isn't part of the project anymore, would either of you be open to taking this issue forward?
I have made a procedure for reproducing this issue at this comment.
Let me know if you need any more context or want to sync up on next steps.
Same issue here using fluent operator - liveness probe workaround not possible due to distroless images.
Hello @danelson @networkhell . I'm using the distroless image as well. However, I'm doing a little trick here. I'm using an initcontainer of a busybox image that copies the necessary binaries to make the workaround work
initContainers:
- args:
- -c
- cp -a /bin/sh /usr/local/bin && cp -a /bin/wget /usr/local/bin && cp -a /bin/awk
/usr/local/bin && cp -a /bin/cat /usr/local/bin
command:
- sh
image: busybox:uclibc
imagePullPolicy: IfNotPresent
name: busybox-share-init
My workaround solves the issue, but it's necessary to add this initcontainer.
Cheers.
@joseeetm thanks for the hint. Unfortunately I am using the fluent-operator which has currently no support for a custom liveness probe.
@joseeetm thank you. Do I assume correctly that you then also need a volume and volume mount so that this can be accessed by the main container?
So I would override values.yaml with something like
fluentbit:
# List of volumes that can be mounted by containers belonging to the pod.
additionalVolumes:
- name: shared-bin
emptyDir: {}
# Pod volumes to mount into the container's filesystem.
additionalVolumesMounts:
- name: shared-bin
mountPath: /usr/local/bin
# initContainers configuration for Fluent Bit pods. Ref: https://kubernetes.io/docs/concepts/workloads/pods/init-container
initContainers:
- args:
- -c
- cp -a /bin/sh /usr/local/bin && cp -a /bin/wget /usr/local/bin && cp -a /bin/awk /usr/local/bin && cp -a /bin/cat /usr/local/bin
command:
- sh
image: busybox:uclibc
imagePullPolicy: IfNotPresent
name: busybox-share-init
livenessProbe:
exec:
command:
- /busybox/sh
- -ec
- |
# My custom probe
@networkhell you should be able to override the livenessProbe