fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Fluent-bit hanging and stopping operation

Open nikitamishagin opened this issue 11 months ago • 23 comments

Bug Report

Describe the bug From time to time, the fluent-bit application hangs and stops collecting logs. No activity is performed, there are no logs from fluent-bit itself, and metrics stop being generated. The Kubernetes pod does not crash or restart when this issue occurs, and it remains in the Ready state. This problem can only be detected manually by noticing that the application's logs have not been delivered.

To Reproduce

  • Deploy helm chart to Kubernetes cluster
  • Start collecting logs from application in cluster
  • Wait for a problem

Expected behavior Stable operation and collection of all logs without hangs.

Your Environment

  • Version used: chart version 0.47.9, app version 3.1.7
  • Configuration:
    [SERVICE]
        Daemon           Off
        Flush            {{ .Values.flush }}
        Log_Level        {{ .Values.logLevel }}
        Parsers_File     /fluent-bit/etc/parsers.conf
        Parsers_File     /fluent-bit/etc/conf/custom_parsers.conf
        HTTP_Server      On
        HTTP_Listen      0.0.0.0
        HTTP_Port        {{ .Values.metricsPort }}
        Health_Check     On
        storage.sync     normal
        storage.checksum off
        storage.path     /var/log/flb-storage/

    [INPUT]
        Name           systemd
        Tag            host.*
        Read_From_Tail On
        storage.type   filesystem

    [INPUT]
        Name            tail
        Tag             kube.*
        Path            /var/log/containers/*.log
        Parser          cri
        Skip_Long_Lines On
        storage.type    filesystem

    [INPUT]
        Name          tail
        Tag           audit.*
        Path          /var/log/kube-audit/kube-audit.log
        storage.type  filesystem

    [FILTER]
        Name modify
        Match host.*
        Rename _HOSTNAME @HOSTNAME
        Rename _MACHINE_ID @MACHINE_ID
        Rename _BOOT_ID @BOOT_ID
        Rename _CMDLINE @CMDLINE
        Rename _COMM @COMM
        Rename _EXE @EXE
        Rename _PID @PID
        Rename _UID @UID
        Rename _GID @GID
        Rename _CAP_EFFECTIVE @CAP_EFFECTIVE
        Rename _SELINUX_CONTEXT @SELINUX_CONTEXT
        Rename _AUDIT_LOGINUID @AUDIT_LOGINUID
        Rename _AUDIT_SESSION @AUDIT_SESSION
        Rename _SYSTEMD_CGROUP @SYSTEMD_CGROUP
        Rename _SYSTEMD_SLICE @SYSTEMD_SLICE
        Rename _SYSTEMD_UNIT @SYSTEMD_UNIT
        Rename _SYSTEMD_SESSION @SYSTEMD_SESSION
        Rename _SYSTEMD_OWNER_UID @SYSTEMD_OWNER_UID
        Rename _TRANSPORT @TRANSPORT
        Rename _KERNEL_DEVICE @KERNEL_DEVICE
        Rename _KERNEL_SUBSYSTEM @KERNEL_SUBSYSTEM
        Rename _UDEV_SYSNAME @UDEV_SYSNAME
        Rename _UDEV_DEVNODE @UDEV_DEVNODE
        Rename _UDEV_DEVLINK @UDEV_DEVLINK
        Remove _SOURCE_REALTIME_TIMESTAMP

    [FILTER]
        Name         parser
        Match        kube.*
        Key_Name     message
        Parser       json_custom
        Reserve_Data True

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name   lua
        Match  kube.*
        script /fluent-bit/scripts/filters.lua
        Call   modify_log

    [FILTER]
        Name  rewrite_tag
        Match kube.*
        Rule  $es_index_date ^(daily)$ daily.$TAG false

    [FILTER]
        Name  rewrite_tag
        Match kube.*
        Rule  $es_index_date ^(weekly)$ weekly.$TAG false

    [FILTER]
        Name  rewrite_tag
        Match kube.*
        Rule  $es_index_date ^(monthly)$ monthly.$TAG false

    [FILTER]
        Name   record_modifier
        Match  audit.*
        Record cluster <cluster-name>

    [OUTPUT]
        Name                     es
        Match                    host.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix          infra=journal
        Logstash_DateFormat      %Y-%m-%d
        Time_Key                 @SOURCE_REALTIME_TIMESTAMP
        Time_Key_Format          %Y-%m-%dT%H:%M:%S
        storage.total_limit_size 1G

    [OUTPUT]
        Name                     es
        Match                    daily.kube.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix_Key      pre_index
        Logstash_DateFormat      %Y-%m-%d
        Time_Key_Nanos           On
        storage.total_limit_size 1G

    [OUTPUT]
        Name                     es
        Match                    weekly.kube.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix_Key      pre_index
        Logstash_DateFormat      %Y-w%W
        Time_Key_Nanos           On
        storage.total_limit_size 1G

    [OUTPUT]
        Name                     es
        Match                    monthly.kube.*
        Host                     ${ELASTICSEARCH_HOST}
        Port                     ${ELASTICSEARCH_PORT}
        HTTP_User                ${ELASTICSEARCH_USER}
        HTTP_Passwd              ${ELASTICSEARCH_PASSWORD}
        Logstash_Format          On
        Logstash_Prefix_Key      pre_index
        Logstash_DateFormat      %Y-m%m
        Time_Key_Nanos           On
        storage.total_limit_size 1G

    [OUTPUT]
        name                     syslog
        match                    audit.*
        host                     <private-host>
        port                     15110
        mode                     udp
        syslog_format            rfc3164
        syslog_maxsize           2048
        syslog_hostname_key      cluster
        syslog_message_key       log
        storage.total_limit_size 4G

    [PARSER]
        Name   json_custom
        Format json

  filters.lua: |
    function modify_log(tag, timestamp, record)
        -- Add cluster name.
        record.kubernetes.cluster = "{{ .Environment.Name }}"
        -- Check that annotations exist.
        if record["kubernetes"]["annotations"] then
            -- Check that es-index-project annotation exists.
            if record["kubernetes"]["annotations"]["example.com/es-index-project"] then
                -- Check that es-index-project annotation contains a value
                if string.len(record["kubernetes"]["annotations"]["example.com/es-index-project"]) > 0 then
                    es_index_project = record["kubernetes"]["annotations"]["example.com/es-index-project"]
                else
                    -- If there is no value, then you need to assign fixme so as not to lose logs
                    es_index_project = "fixme"
                end
                -- Check that es-index-date annotation exists.
                if record["kubernetes"]["annotations"]["example.com/es-index-date"] then
                    es_index_date = record["kubernetes"]["annotations"]["example.com/es-index-date"]
                    if es_index_date == "daily" then
                        record.es_index_date = es_index_date
                    elseif es_index_date == "weekly" then
                        record.es_index_date = es_index_date
                    elseif es_index_date == "monthly" then
                        record.es_index_date = es_index_date
                    else
                        -- Discard the log if es-index-date is not equal daily|weekly|monthly.
                        return -1, 0, 0
                    end
                    -- Check that es-index-ext annotation exists.
                    if record["kubernetes"]["annotations"]["example.com/es-index-ext"] then
                        es_index_ext = record["kubernetes"]["annotations"]["example.com/es-index-ext"]
                        -- Check that es-index-ext is not nil.
                        if string.len(es_index_ext) > 0 then
                            -- Create a prefix with es_index_ext key for the index.
                            record.pre_index = es_index_project .. "=k8s-" .. es_index_ext
                        end
                    else
                        -- Create a prefix without es_index_ext key for the index.
                        record.pre_index = es_index_project .. "=k8s"
                    end
                else
                    -- Discard the log if es-index-date is not exists.
                    return -1, 0, 0
                end
            else
                -- Discard the log if es-index-project is not exists.
                return -1, 0, 0
            end
        else
            -- Discard the log if es-index-project is not exists.
            return -1, 0, 0
        end
        -- Return the modified log.
        return 2, timestamp, record
    end

  • Environment name and version (e.g. Kubernetes? What version?): vanilla kubernetes v1.30.6
  • Server type and version: VM on libvirt
  • Operating System and version: CentOS 9 Stream 6.10.10-1.el9.elrepo.x86_64
  • Filters and plugins: In config.

Additional context The issue does not occur frequently. It happens a few times per week on one of over 400 instances. It might occur more often, but we do not notice it due to the difficulty of detecting the problem.

nikitamishagin avatar Feb 07 '25 17:02 nikitamishagin

I can confirm that I am experiencing the same issue with the same configuration and versions. The behavior is identical to what has been reported.

Snak3GMS avatar Feb 07 '25 17:02 Snak3GMS

@Snak3GMS @nikitamishagin any logs that you can attach ?

Seems like behavior isn't exactly perfect, but these might all be related: https://github.com/fluent/fluent-bit/issues/9917 https://github.com/fluent/fluent-bit/issues/9906

These all exhibit behaviors of hanging

pawel-lmcb avatar Feb 07 '25 22:02 pawel-lmcb

@pawel-lmcb

  1. strace
[pid    18] 00:12:40.898547 <... restart_syscall resumed>) = 0 <0.891804>
[pid    18] 00:12:40.898699 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:41.064008 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000104>
[pid     1] 00:12:41.064069 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    18] 00:12:41.898923 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000146>
[pid    18] 00:12:41.899006 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:42.064434 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000231>
[pid     1] 00:12:42.064566 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    16] 00:12:42.152983 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.498995>
[pid    15] 00:12:42.153083 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.499113>
[pid    14] 00:12:42.153209 <... epoll_wait resumed>[{EPOLLIN, {u32=1692426400, u64=139820057780384}}], 64, -1) = 1 <1.499217>
[pid    13] 00:12:42.153285 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.499286>
[pid    12] 00:12:42.153319 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.499237>
[pid    11] 00:12:42.153364 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.499269>
[pid    10] 00:12:42.153389 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.499238>
[pid     9] 00:12:42.153414 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.499250>
[pid    16] 00:12:42.153447 read(184,  <unfinished ...>
[pid    15] 00:12:42.153461 read(180,  <unfinished ...>
[pid    14] 00:12:42.153474 read(176,  <unfinished ...>
[pid    13] 00:12:42.153501 read(172,  <unfinished ...>
[pid    12] 00:12:42.153515 read(135,  <unfinished ...>
[pid    11] 00:12:42.153534 read(137,  <unfinished ...>
[pid    10] 00:12:42.153548 read(136,  <unfinished ...>
[pid    16] 00:12:42.153574 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000111>
[pid    15] 00:12:42.153629 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000158>
[pid    14] 00:12:42.153650 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000167>
[pid    13] 00:12:42.153669 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000160>
[pid    12] 00:12:42.153690 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000166>
[pid    11] 00:12:42.153712 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000168>
[pid     9] 00:12:42.153734 read(134,  <unfinished ...>
[pid    16] 00:12:42.153754 epoll_wait(161,  <unfinished ...>
[pid    15] 00:12:42.153803 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:42.153828 epoll_wait(149,  <unfinished ...>
[pid    13] 00:12:42.153851 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:42.153871 epoll_wait(121,  <unfinished ...>
[pid    11] 00:12:42.153893 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:42.153908 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000349>
[pid    16] 00:12:42.153966 <... epoll_wait resumed>[], 64, 0) = 0 <0.000189>
[pid    15] 00:12:42.153986 <... epoll_wait resumed>[], 64, 0) = 0 <0.000168>
[pid    14] 00:12:42.154003 <... epoll_wait resumed>[], 64, 0) = 0 <0.000159>
[pid    13] 00:12:42.154020 <... epoll_wait resumed>[], 64, 0) = 0 <0.000157>
[pid    12] 00:12:42.154037 <... epoll_wait resumed>[], 64, 0) = 0 <0.000152>
[pid    11] 00:12:42.154055 <... epoll_wait resumed>[], 64, 0) = 0 <0.000153>
[pid     9] 00:12:42.154072 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000329>
[pid    16] 00:12:42.154126 epoll_wait(161,  <unfinished ...>
[pid    15] 00:12:42.154161 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:42.154176 epoll_wait(149,  <unfinished ...>
[pid    13] 00:12:42.154188 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:42.154203 epoll_wait(121,  <unfinished ...>
[pid    11] 00:12:42.154216 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:42.154229 epoll_wait(109,  <unfinished ...>
[pid     9] 00:12:42.154242 epoll_wait(106,  <unfinished ...>
[pid    10] 00:12:42.154258 <... epoll_wait resumed>[], 64, 0) = 0 <0.000020>
[pid     9] 00:12:42.154278 <... epoll_wait resumed>[], 64, 0) = 0 <0.000027>
[pid    10] 00:12:42.154297 epoll_wait(109,  <unfinished ...>
[pid     9] 00:12:42.154313 epoll_wait(106,  <unfinished ...>
[pid    19] 00:12:42.652866 <... epoll_wait resumed>[{EPOLLIN, {u32=1644290128, u64=139820009644112}}], 256, -1) = 1 <2.646141>
[pid    19] 00:12:42.652957 read(202, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000025>
[pid    19] 00:12:42.653034 epoll_wait(196,  <unfinished ...>
[pid    18] 00:12:42.899288 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000156>
[pid    18] 00:12:42.899357 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:43.064814 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000198>
[pid     1] 00:12:43.064876 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    15] 00:12:43.652891 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.498676>
[pid    13] 00:12:43.653018 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.498813>
[pid    11] 00:12:43.653115 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.498886>
[pid     9] 00:12:43.653176 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.498851>
[pid    16] 00:12:43.653209 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.499074>
[pid    15] 00:12:43.653234 read(180,  <unfinished ...>
[pid    14] 00:12:43.653252 <... epoll_wait resumed>[{EPOLLIN, {u32=1692426400, u64=139820057780384}}], 64, -1) = 1 <1.499068>
[pid    13] 00:12:43.653276 read(172,  <unfinished ...>
[pid    12] 00:12:43.653292 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.499079>
[pid    11] 00:12:43.653315 read(137,  <unfinished ...>
[pid    10] 00:12:43.653328 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.499021>
[pid    16] 00:12:43.653356 read(184,  <unfinished ...>
[pid    15] 00:12:43.653372 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000127>
[pid    14] 00:12:43.653395 read(176,  <unfinished ...>
[pid    13] 00:12:43.653409 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000124>
[pid    11] 00:12:43.653429 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000105>
[pid     9] 00:12:43.653450 read(134,  <unfinished ...>
[pid    16] 00:12:43.653474 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000108>
[pid    15] 00:12:43.653504 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:43.653567 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000161>
[pid    13] 00:12:43.653590 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:43.653603 read(135,  <unfinished ...>
[pid    11] 00:12:43.653617 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:43.653631 read(136,  <unfinished ...>
[pid     9] 00:12:43.653641 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000182>
[pid    16] 00:12:43.653672 epoll_wait(161,  <unfinished ...>
[pid    15] 00:12:43.653685 <... epoll_wait resumed>[], 64, 0) = 0 <0.000171>
[pid    14] 00:12:43.653702 epoll_wait(149,  <unfinished ...>
[pid    13] 00:12:43.653717 <... epoll_wait resumed>[], 64, 0) = 0 <0.000118>
[pid    12] 00:12:43.653732 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000121>
[pid    11] 00:12:43.653754 <... epoll_wait resumed>[], 64, 0) = 0 <0.000127>
[pid    10] 00:12:43.653801 <... read resumed>"\1\0\0\0\0\0\0\0", 8) = 8 <0.000162>
[pid     9] 00:12:43.653825 epoll_wait(106,  <unfinished ...>
[pid    16] 00:12:43.653848 <... epoll_wait resumed>[], 64, 0) = 0 <0.000167>
[pid    15] 00:12:43.653865 epoll_wait(158,  <unfinished ...>
[pid    14] 00:12:43.653878 <... epoll_wait resumed>[], 64, 0) = 0 <0.000166>
[pid    13] 00:12:43.653894 epoll_wait(140,  <unfinished ...>
[pid    12] 00:12:43.653908 epoll_wait(121,  <unfinished ...>
[pid    11] 00:12:43.653922 epoll_wait(118,  <unfinished ...>
[pid    10] 00:12:43.653938 epoll_wait(109,  <unfinished ...>
[pid     9] 00:12:43.653952 <... epoll_wait resumed>[], 64, 0) = 0 <0.000118>
[pid    16] 00:12:43.653974 epoll_wait(161,  <unfinished ...>
[pid    14] 00:12:43.653986 epoll_wait(149,  <unfinished ...>
[pid    12] 00:12:43.653998 <... epoll_wait resumed>[], 64, 0) = 0 <0.000081>
[pid    10] 00:12:43.654015 <... epoll_wait resumed>[], 64, 0) = 0 <0.000069>
[pid    12] 00:12:43.654038 epoll_wait(121,  <unfinished ...>
[pid     9] 00:12:43.654051 epoll_wait(106,  <unfinished ...>
[pid    10] 00:12:43.654067 epoll_wait(109,  <unfinished ...>
[pid    18] 00:12:43.899534 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000131>
[pid    18] 00:12:43.899609 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:44.065226 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000236>
[pid     1] 00:12:44.065374 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    19] 00:12:44.241955 <... epoll_wait resumed>[{EPOLLIN, {u32=1644212352, u64=139820009566336}}], 256, -1) = 1 <1.588873>
[pid    19] 00:12:44.242051 accept4(199, {sa_family=AF_INET, sin_port=htons(33386), sin_addr=inet_addr("10.99.10.35")}, [16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 46 <0.000034>
[pid    19] 00:12:44.242162 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000017>
[pid    19] 00:12:44.242232 epoll_wait(196, [{EPOLLIN, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000014>
[pid    19] 00:12:44.242305 recvfrom(46, "GET /api/v1/metrics/prometheus H"..., 4096, 0, NULL, NULL) = 1733 <0.000017>
[pid    19] 00:12:44.242478 newfstatat(AT_FDCWD, "/api/v1/metrics/prometheus", 0x7f2a62bfafd0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) <0.000034>
[pid    19] 00:12:44.242809 writev(46, [{iov_base="HTTP/1.1 200 OK\r\n", iov_len=17}, {iov_base="Server: Monkey/1.7.2\r\nDate: Sat,"..., iov_len=59}, {iov_base="Transfer-Encoding: chunked\r\n", iov_len=28}], 3) = 104 <0.000050>
[pid    19] 00:12:44.242926 writev(46, [{iov_base="Content-Type: text/plain; versio"..., iov_len=41}, {iov_base="\r\n", iov_len=2}], 2) = 43 <0.000034>
[pid    19] 00:12:44.243018 sendto(46, "2255\r\n", 6, 0, NULL, 0) = 6 <0.000054>
[pid    19] 00:12:44.243136 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000019>
[pid    19] 00:12:44.243221 sendto(46, "# HELP fluentbit_filter_add_reco"..., 8789, 0, NULL, 0) = 8789 <0.000075>
[pid    19] 00:12:44.243357 sendto(46, "\r\n", 2, 0, NULL, 0) = 2 <0.000019>
[pid    19] 00:12:44.243430 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000014>
[pid    19] 00:12:44.243504 epoll_wait(196, [{EPOLLOUT, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000016>
[pid    19] 00:12:44.243588 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000045>
[pid    19] 00:12:44.243699 sendto(46, "0\r\n\r\n", 5, 0, NULL, 0) = 5 <0.000015>
[pid    19] 00:12:44.243811 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000021>
[pid    19] 00:12:44.243883 epoll_wait(196, [{EPOLLOUT, {u32=1644781568, u64=139820010135552}}], 256, -1) = 1 <0.000029>
[pid    19] 00:12:44.243958 epoll_ctl(196, EPOLL_CTL_DEL, 46, NULL) = 0 <0.000015>
[pid    19] 00:12:44.244051 epoll_ctl(196, EPOLL_CTL_ADD, 46, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1644781568, u64=139820010135552}}) = 0 <0.000015>
[pid    19] 00:12:44.244122 epoll_wait(196,  <unfinished ...>
[pid    18] 00:12:44.899854 <... clock_nanosleep resumed>0x7f2a635fb760) = 0 <1.000158>
[pid    18] 00:12:44.899931 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid     1] 00:12:45.065666 <... clock_nanosleep resumed>0x7fff3e6c4b10) = 0 <1.000178>
[pid     1] 00:12:45.065735 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid    16] 00:12:45.154860 <... epoll_wait resumed>[{EPOLLIN, {u32=1684037792, u64=139820049391776}}], 64, -1) = 1 <1.500826>
[pid    15] 00:12:45.154910 <... epoll_wait resumed>[{EPOLLIN, {u32=1688232096, u64=139820053586080}}], 64, -1) = 1 <1.501036>
[pid    13] 00:12:45.154936 <... epoll_wait resumed>[{EPOLLIN, {u32=1696620704, u64=139820061974688}}], 64, -1) = 1 <1.501033>
[pid    12] 00:12:45.154972 <... epoll_wait resumed>[{EPOLLIN, {u32=1742758048, u64=139820108112032}}], 64, -1) = 1 <1.500919>
[pid    11] 00:12:45.154999 <... epoll_wait resumed>[{EPOLLIN, {u32=1746952352, u64=139820112306336}}], 64, -1) = 1 <1.501068>
[pid    10] 00:12:45.155025 <... epoll_wait resumed>[{EPOLLIN, {u32=1751146656, u64=139820116500640}}], 64, -1) = 1 <1.500950>
[pid     9] 00:12:45.155050 <... epoll_wait resumed>[{EPOLLIN, {u32=1755340960, u64=139820120694944}}], 64, -1) = 1 <1.500989>
  1. Curl request to ‘/api/v1/storage’
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 4,
      "mem_chunks": 4,
      "fs_chunks": 0,
      "fs_chunks_up": 0,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "systemd.0": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "tail.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "tail.2": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "storage_backlog.3": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "emitter_for_rewrite_tag.4": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "9.5M"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "emitter_for_rewrite_tag.5": {
      "status": {
        "overlimit": false,
        "mem_size": "22.2K",
        "mem_limit": "9.5M"
      },
      "chunks": {
        "total": 4,
        "up": 4,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "emitter_for_rewrite_tag.6": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "9.5M"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}
  1. Pod describe(Current state)
    State:          Running
    Started:        Mon, 30 Dec 2024 03:39:27 +0300
    Ready:          True
    Restart Count:  2
    Limits:
        cpu:        200m
        memory:     128Mi
    Requests:
        cpu:        200m
        memory:     128Mi
    Conditions:
        Type                        Status
        PodReadyToStartContainers   True
        Initialized                 True
        Ready                       True
        ContainersReady             True
        PodScheduled                True
  1. Last logs (stuck about a month ago)
[2025/01/09 15:24:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=337644498 watch_fd=4787 name=/var/log/containers/xxx-5784f8cbcd-2txq5_xxx_migrations-4b50a78e51a4e07aa11245e624fc4e67bfd0d4b34edb965388eb0eaf05c2dca5.log
[2025/01/09 15:24:28] [ warn] [engine] failed to flush chunk '1-1736436267.677796148.flb', retry in 9 seconds: task_id=7, input=emitter_for_rewrite_tag.6 > output=es.3 (out_id=3)
[2025/01/09 15:24:37] [error] [engine] chunk '1-1736436267.677796148.flb' cannot be retried: task_id=7, input=emitter_for_rewrite_tag.6 > output=es.3
[2025/01/09 15:25:26] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=433188932 watch_fd=4506
[2025/01/09 15:25:27] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=370864192 watch_fd=4775
[2025/01/09 15:25:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=332235753 watch_fd=4509
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=296558453 watch_fd=4756
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=449752668 watch_fd=4500
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=386694775 watch_fd=4748
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=72887180 watch_fd=4542
[2025/01/09 15:25:29] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=5697947 watch_fd=4383
[2025/01/09 15:27:27] [ info] [filter:kubernetes:kubernetes.2]  token updated
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=391506156 watch_fd=4788 name=/var/log/containers/xxx-b57647789-gfbvk_xxx-static-dc5b651c143f9e36af9c19da6cde65e55d252417df909c551aad241f4aef5a64.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=223083475 watch_fd=4789 name=/var/log/containers/xxx-7b8c7bd55f-bn7ch_xxx-server-6520de96062c8f54116157f179ab0c8460c31ac6da042a33fd5f0cb2c59b57f7.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=400043213 watch_fd=4790 name=/var/log/containers/xxx-bob-7b8c7bd55f-bn7ch_xxx-xstatic-2a429bca73887c93233634ca80462272fc2669bb6cdf95da84c687d6dbab4ff9.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=244744155 watch_fd=4791 name=/var/log/containers/xxx-bob-7b8c7bd55f-rqxh2_xxx-xnews-server-5c8fc7ae42b6924bb0c2bfe9ed9ea4c4f82d46907696bfd9262f99f042b0f0bf.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=386711596 watch_fd=4792 name=/var/log/containers/xxx-bob-7b8c7bd55f-rqxh2_xxx-xstatic-e0ba58f5aca4b17063df13b4efd2018ecad2f896fdda82a612fb5875b6f01b22.log
[2025/01/09 15:27:27] [ info] [input:tail:tail.1] inotify_fs_add(): inode=213421478 watch_fd=4793 name=/var/log/containers/xxx-bob-b57647789-gfbvk_xxx-xserver-1c25bac499bfecc754d112857ec94805ced64f35db604af02c1d6a8c5c9b6423.log
[2025/01/09 15:27:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=391506156 watch_fd=4788
[2025/01/09 15:27:28] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=213421478 watch_fd=4793

Snak3GMS avatar Feb 08 '25 00:02 Snak3GMS

these seem similar to #9903 as well, could you also check the cpu usage when fluent-bit hangs?

AzureLeMoon avatar Feb 09 '25 18:02 AzureLeMoon

@AzureLeMoon, ok. I will record the cpu usage and send it when the problem repeats

nikitamishagin avatar Feb 10 '25 08:02 nikitamishagin

@pawel-lmcb, there is nothing unusual in the logs. They look something like this:

[2025/02/10 07:28:51] [ warn] [engine] failed to flush chunk '1-1739172530.835782756.flb', retry in 7 seconds: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 07:28:58] [ info] [engine] flush chunk '1-1739172530.835782756.flb' succeeded at retry 1: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:00:46] [ warn] [engine] failed to flush chunk '1-1739174446.210853217.flb', retry in 8 seconds: task_id=0, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:00:54] [ info] [engine] flush chunk '1-1739174446.210853217.flb' succeeded at retry 1: task_id=0, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:04:53] [ warn] [engine] failed to flush chunk '1-1739174693.685714426.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)
[2025/02/10 08:05:03] [ info] [engine] flush chunk '1-1739174693.685714426.flb' succeeded at retry 1: task_id=3, input=emitter_for_rewrite_tag.5 > output=es.2 (out_id=2)

And when a problem occurs the logging just stops.

nikitamishagin avatar Feb 10 '25 08:02 nikitamishagin

@AzureLeMoon Below is the CPU usage from the same host where the fluent-bit pod logs were taken. Fluent-bit pod is still stuck on this host since January 9th.

[xxx@xxx containers]$ mpstat 1 5
Linux 6.10.10-1.el9.elrepo.x86_64 (xxx) 	02/10/2025 	_x86_64_	(16 CPU)

12:02:14 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:02:15 PM  all    8.98    0.00    1.32    0.00    0.00    0.06    0.00    0.00    0.00   89.64
12:02:16 PM  all    7.76    0.00    0.94    0.00    0.00    0.25    0.00    0.00    0.00   91.05
12:02:17 PM  all    5.47    0.00    0.82    0.00    0.00    0.06    0.00    0.00    0.00   93.65
12:02:18 PM  all    6.03    0.00    1.38    0.00    0.00    0.06    0.00    0.00    0.00   92.52
12:02:19 PM  all    7.83    0.00    2.19    0.00    0.00    0.06    0.00    0.00    0.00   89.91
Average:     all    7.22    0.00    1.33    0.00    0.00    0.10    0.00    0.00    0.00   91.35

kubectl top pod fluent-bit-5gx94 --namespace kube-system
NAME               CPU(cores)   MEMORY(bytes)
fluent-bit-5gx94   1m           106Mi

Snak3GMS avatar Feb 10 '25 09:02 Snak3GMS

Bumping also this, looks similar https://github.com/fluent/fluent-bit/issues/9557

uristernik avatar Feb 10 '25 18:02 uristernik

Bumping also this, looks similar https://github.com/fluent/fluent-bit/issues/9557

This appears to be the exact problem i have as described in #9903

AzureLeMoon avatar Feb 10 '25 18:02 AzureLeMoon

same here

dyipon avatar Feb 28 '25 19:02 dyipon

@AzureLeMoon, I've observed the cpu consumption of hung containers. It tends to zero (<1mcpu).

nikitamishagin avatar Mar 05 '25 11:03 nikitamishagin

Also on one of the hanging containers strace shows this:

clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7ffd98404b70) = 0

then the lines repeat endlessly.

nikitamishagin avatar Mar 05 '25 16:03 nikitamishagin

similar strace output as in https://github.com/fluent/fluent-bit/issues/9917

yhzs8 avatar Mar 14 '25 09:03 yhzs8

running fluent-bit v2.2.3, cannot reproduce this issue.

yhzs8 avatar Mar 15 '25 17:03 yhzs8

Exactly same here, similar config and fluent-bit v3.2.5. There are some pods of our daemonest that got stuck without collecting new logs.

CPU and memory consumption are below their limits

Image

We realized because there were some pods which we couldn't check their logs in elastic. As @nikitamishagin issue, the pod is ready and no restarts despite having it's liveness probe enabled (we're deploying fluentbit using fluent-operator v3.3.0 helm chart).

Our logs seems pretty normal: [2025/04/16 09:27:24] [ info] [input] pausing tail.1 [2025/04/16 09:27:25] [ info] [input] pausing systemd.0 [2025/04/16 09:27:25] [ info] [input] pausing tail.1 [2025/04/16 09:27:26] [ info] [input] pausing systemd.0 [2025/04/16 09:27:26] [ info] [input] pausing tail.1 [2025/04/16 09:27:27] [ info] [input] pausing systemd.0 [2025/04/16 09:27:27] [ info] [input] pausing tail.1 [2025/04/16 09:27:28] [ info] [input] pausing systemd.0 [2025/04/16 09:27:28] [ info] [input] pausing tail.1 [2025/04/16 09:27:29] [ info] [input] pausing systemd.0

joseeetm avatar Apr 16 '25 09:04 joseeetm

We have the same issue, any news about this?

R-Studio avatar May 19 '25 11:05 R-Studio

Would it be possible for someone in this thread to provide a smaller repro case or help me in the reproduction process?

leonardo-albertovich avatar May 19 '25 12:05 leonardo-albertovich

Would it be possible for someone in this thread to provide a smaller repro case or help me in the reproduction process?

Hey @leonardo-albertovich , i would be happy to help. Could you please specify what is required?

AzureLeMoon avatar May 19 '25 22:05 AzureLeMoon

In order to analyze the issue I need to reproduce it so I need to find a repro case that can be minimized. The original one is a bit convoluted so if anyone else has a different configuration that reproduces the issue (even if not frequently) I'd like to take a look at it.

Especially because I can use that to find common grounds for the issue and narrow down the settings I use to try to reproduce the issue and parts of the system I need to inspect.

leonardo-albertovich avatar May 19 '25 22:05 leonardo-albertovich

Would it be possible for someone in this thread to provide a smaller repro case or help me in the reproduction process?

Sorry, but that's all the information I have.

The problem was reproducing sometimes and for no apparent reason. I think it may be linked with elasticsearch output module. Also, you should generate enough workload so that fluent-bit continuously sends data to elasticsearch.

nikitamishagin avatar May 19 '25 23:05 nikitamishagin

We cannot reproduce the issue but we fall into it when the OpenShift/Kubernetes API was not available. Inspired by this GitHub issue, we created a own livenessProbe as a workaround:

      livenessProbe:
        exec:
          # The following check monitors the Fluent Bit metric "fluentbit_input_records_total{name="tail.0"}" to ensure log intake is progressing as expected.
          # - It checks (and initializes, if needed) a reference file for the last recorded input count.
          # - It fetches the current count from the Fluent Bit Prometheus endpoint.
          # - If the new count is less than or equal to the previous value, it reports an error
          #   and exits (which may indicate issues with log collection or Kubernetes API).
          # - If the count is valid (>=0), it updates the reference file for the next check.
          command:
            - /bin/bash
            - -exc
            - >
              [ -e ./lastInputRecordsTotal ] || echo 0 > ./lastInputRecordsTotal;
              function getInputRecordsTotal() { 
                set -o pipefail; curl -s --connect-timeout 5 --retry 3 -f http://localhost:2020/api/v1/metrics/prometheus | grep '^fluentbit_input_records_total{name="tail.0"}' | awk '{print $2}' || echo 0;
              };
              T=$(getInputRecordsTotal); test -e ./lastInputRecordsTotal && { test $(cat ./lastInputRecordsTotal) -ge $T && { echo "The input records is too low, maybe there are some issue with the Kubernetes API (current input records = $T)"; exit 1;} ;}; test $T -ge 0 && { echo "$T" >| ./lastInputRecordsTotal; } || true;
        initialDelaySeconds: 300
        periodSeconds: 60
        successThreshold: 1
        failureThreshold: 5
        timeoutSeconds: 5

R-Studio avatar May 20 '25 05:05 R-Studio

In order to analyze the issue I need to reproduce it so I need to find a repro case that can be minimized. The original one is a bit convoluted so if anyone else has a different configuration that reproduces the issue (even if not frequently) I'd like to take a look at it.

Especially because I can use that to find common grounds for the issue and narrow down the settings I use to try to reproduce the issue and parts of the system I need to inspect.

i made the following repo you could use to reproduce the hanging issue as described in #9903 , and i believe it should relate to this issue as well.

@leonardo-albertovich let me know if you need more data.

AzureLeMoon avatar May 20 '25 09:05 AzureLeMoon

We cannot reproduce the issue but we fall into it when the OpenShift/Kubernetes API was not available. Inspired by this GitHub issue, we created a own livenessProbe as a workaround:

  livenessProbe:
    exec:
      # The following check monitors the Fluent Bit metric "fluentbit_input_records_total{name="tail.0"}" to ensure log intake is progressing as expected.
      # - It checks (and initializes, if needed) a reference file for the last recorded input count.
      # - It fetches the current count from the Fluent Bit Prometheus endpoint.
      # - If the new count is less than or equal to the previous value, it reports an error
      #   and exits (which may indicate issues with log collection or Kubernetes API).
      # - If the count is valid (>=0), it updates the reference file for the next check.
      command:
        - /bin/bash
        - -exc
        - >
          [ -e ./lastInputRecordsTotal ] || echo 0 > ./lastInputRecordsTotal;
          function getInputRecordsTotal() { 
            set -o pipefail; curl -s --connect-timeout 5 --retry 3 -f http://localhost:2020/api/v1/metrics/prometheus | grep '^fluentbit_input_records_total{name="tail.0"}' | awk '{print $2}' || echo 0;
          };
          T=$(getInputRecordsTotal); test -e ./lastInputRecordsTotal && { test $(cat ./lastInputRecordsTotal) -ge $T && { echo "The input records is too low, maybe there are some issue with the Kubernetes API (current input records = $T)"; exit 1;} ;}; test $T -ge 0 && { echo "$T" >| ./lastInputRecordsTotal; } || true;
    initialDelaySeconds: 300
    periodSeconds: 60
    successThreshold: 1
    failureThreshold: 5
    timeoutSeconds: 5

@R-Studio Thanks for sharing. You inspired me to reach a different livenessprobe, with a different goal. My goal is ensuring that our output doesn't hangs, in your example you're just checking that input records are being processed. However, the main goal is ensuring that Fluent Bit is still forwarding logs to its configured output. I'm sharing it below:

exec:
  command:
    - /busybox/sh
    - -ec
    - |
      state=/tmp/lastOutputBytesTotal
      [ -f "$state" ] || echo 0 >"$state"

      # Sumamos la columna 2 (valor) de todas las líneas de la métrica
      total=$(
        wget -qO- --timeout=5 http://localhost:2020/api/v1/metrics/prometheus \
        | grep '^fluentbit_output_proc_bytes_total{name="es\.' \
        | awk '{sum+=$2} END{print sum+0}' || echo 0
      )

      last=$(cat "$state")
      if [ "$total" -le "$last" ]; then
        echo "Bytes procesados estancados: $total (antes $last)"
        exit 1
      fi

      echo "$total" >"$state"
initialDelaySeconds: 300
periodSeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5

joseeetm avatar Jun 04 '25 13:06 joseeetm

I also seem to be running into this issue (or something similar) when using fluent-operator. Just adding some information in case it helps.

I am running as a daemonset and I am using the istio ambient service mesh. In all instances I have been able to check, it happens during startup. It appears that if it cannot successfully export in the first ~20 seconds then it gets into this stuck state. Manually restarting the pod gets it working again and it backfills logs.

I also use the opentelemetry plugin and it has the same problem of getting stuck and not sending metrics.

One thing I noticed is that the uptime value is not consistent between APIs.

http://localhost:2020/api/v2/metrics/prometheus

# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime{hostname="fluent-bit-jbg6x"} 35

http://localhost:2020/api/v1/uptime

{"uptime_sec":30138,"uptime_hr":"Fluent Bit has been running:  0 day, 8 hours, 22 minutes and 18 seconds"}

I am going to try one of the livenessProbe workarounds in the meantime.

danelson avatar Jul 08 '25 15:07 danelson

@R-Studio @joseeetm I am curious what image you are using that lets you run shell commands as the normal image is distroless. Are you using the -debug image or maybe something custom? The -debug image is very large and I can't use it in my environment.

danelson avatar Jul 10 '25 14:07 danelson

Hi @cosmo0920 and @edsiper as Leonardo isn't part of the project anymore, would either of you be open to taking this issue forward?

I have made a procedure for reproducing this issue at this comment.

Let me know if you need any more context or want to sync up on next steps.

AzureLeMoon avatar Jul 23 '25 12:07 AzureLeMoon

Same issue here using fluent operator - liveness probe workaround not possible due to distroless images.

networkhell avatar Jul 24 '25 09:07 networkhell

Hello @danelson @networkhell . I'm using the distroless image as well. However, I'm doing a little trick here. I'm using an initcontainer of a busybox image that copies the necessary binaries to make the workaround work

  initContainers:
  - args:
    - -c
    - cp -a /bin/sh /usr/local/bin && cp -a /bin/wget /usr/local/bin && cp -a /bin/awk
      /usr/local/bin && cp -a /bin/cat /usr/local/bin
    command:
    - sh
    image: busybox:uclibc
    imagePullPolicy: IfNotPresent
    name: busybox-share-init

My workaround solves the issue, but it's necessary to add this initcontainer.

Cheers.

joseeetm avatar Jul 24 '25 10:07 joseeetm

@joseeetm thanks for the hint. Unfortunately I am using the fluent-operator which has currently no support for a custom liveness probe.

networkhell avatar Jul 25 '25 14:07 networkhell

@joseeetm thank you. Do I assume correctly that you then also need a volume and volume mount so that this can be accessed by the main container?

So I would override values.yaml with something like

fluentbit:
  # List of volumes that can be mounted by containers belonging to the pod.
  additionalVolumes:
    - name: shared-bin
      emptyDir: {}
  # Pod volumes to mount into the container's filesystem.
  additionalVolumesMounts:
    - name: shared-bin
      mountPath: /usr/local/bin
  # initContainers configuration for Fluent Bit pods. Ref: https://kubernetes.io/docs/concepts/workloads/pods/init-container
  initContainers:
    - args:
        - -c
        - cp -a /bin/sh /usr/local/bin && cp -a /bin/wget /usr/local/bin && cp -a /bin/awk /usr/local/bin && cp -a /bin/cat /usr/local/bin
      command:
        - sh
      image: busybox:uclibc
      imagePullPolicy: IfNotPresent
      name: busybox-share-init
  livenessProbe:
    exec:
      command:
        - /busybox/sh
        - -ec
        - |
          # My custom probe

@networkhell you should be able to override the livenessProbe

danelson avatar Jul 25 '25 15:07 danelson