fluent-bit fluentbit simply stops when reaches 2048 tasks

Bug Report

When the output http server is offline for a while, after fluentbit reaches 2048 tasks it simply stop, printing only the message re-schedule retry=XXXXXXXXX XXXX in the next 1 seconds. Even after the http output server come back online again it does not start to send the logs (both tail and storage.backlog).

Even with the hot_reload enable, sending the SIGHUP signal does not reload/restart fluentbit process because service will shutdown when all remaining tasks are flushed, but the tasks are not doing anything, are dead.

I'm using the latest v2.2.2 version and this is my config:

    service:
      flush: 5
      hot_reload: on
      http_server: on
      log_level: warn
      scheduler.cap: 300
      storage.path: /data/buffer/
      storage.max_chunks_up: 256
      storage.backlog.mem_limit: 256M
      storage.delete_irrecoverable_chunks: on

    pipeline:
      inputs:
        - name: tail
          db: /data/logs.db
          refresh_interval: 5
          read_from_head: true
          buffer_max_size: 512K
          buffer_chunk_size: 256K
          static_batch_size: 256M
          storage.type: filesystem
          multiline.parser: docker, cri
          tag: <namespace>.<workload>.<container>
          tag_regex: (?<workload>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container>.+)-([a-z0-9]{64})\.log$
          path: /var/log/containers/*_kelvin-admin_*.log,/var/log/containers/*_kelvin_*.log,/var/log/containers/*_app_*.log

      filters:
        - name: record_modifier
          match: "*"
          remove_key: _p

        - name: record_modifier
          match: "*"
          remove_key: stream

        - name: modify
          match: "*"
          add: cluster ${KELVIN_CLUSTER_NAME}

        - name: modify
          match: "*"
          add: node ${KUBERNETES_NODE_NAME}

        - name: lua
          match: "*"
          call: tag_fields
          code: |
            function tag_fields(tag, timestamp, record)
              tags = {}

              for t in string.gmatch(tag, "[^%.]+") do
                table.insert(tags, t)
              end

              record["namespace"] = tags[1]
              record["pod"] = tags[2]

              if record["namespace"] == "app" then
                record["workload"] = string.gsub(tags[2], "-0$", "")
              end

              record["container"] = tags[3]

              return 1, timestamp, record
            end

        - name: lua
          match: "*"
          call: log_fields
          code: |
            function log_fields(tag, timestamp, record)
              cjson = require("cjson")

              log = record["log"]
              status, parsed_log = pcall(cjson.decode, log)

              if status and type(parsed_log) == "table" then
                if parsed_log["level"] then
                  record["level"] = string.lower(parsed_log["level"])
                end
                if parsed_log["logger"] then
                  record["logger"] = parsed_log["logger"]
                end
              end
            
              return 1, timestamp, record
            end

      outputs:
        - name: http
          match: "*"
          host: localhost
          port: 443
          workers: 5
          tls: on
          tls.verify: off
          compress: gzip
          format: json_lines
          net.keepalive: off
          retry_limit: no_limits
          net.connect_timeout: 45
          log_response_payload: false
          storage.total_limit_size: 5G

Jan 19 '24 11:01 carlosrmendes

What is your ulimit setting for file descriptors? ulimit -n I have seen similar issues when the ulimit -n setting has been 1024 which e.g. is the default on Ubuntu distributions.

Feb 20 '24 13:02 sirwio

This issue may be related to the one described here: https://github.com/fluent/fluent-bit/issues/8503

Feb 20 '24 15:02 sirwio

ulimit -n: 1048576

Feb 22 '24 17:02 carlosrmendes

I'm experiencing the same issue. When the output goes down for a certain time, tasks keep retrying until exactly 2048 tasks are created.

Feb 26 '24 20:02 damiancalabresi

This is likely due to poor handling of upper boundary conditions of the tasks_map array defined in flb_config.h

Increasing the size from 2048 -> 4096 increases the the limit of when my filed issue #8503 runs into issue allowing ~ 2x the amount of files to be written to the filesystem before the system stalls.

diff --git a/include/fluent-bit/flb_config.h b/include/fluent-bit/flb_config.h
index 82003386d..cb5953a27 100644
--- a/include/fluent-bit/flb_config.h
+++ b/include/fluent-bit/flb_config.h
@@ -284,7 +284,7 @@ struct flb_config {
     unsigned int sched_cap;
     unsigned int sched_base;
 
-    struct flb_task_map tasks_map[2048];
+    struct flb_task_map tasks_map[4096];
 
     int dry_run;
 };

Feb 28 '24 16:02 sirwio

@sirwio I don't think #8503 is the same issue. In my case, chunks keep accumulating. The log message is different too. I see the same as @carlosrmendes "re-schedule retry"

I ran Fluent in Trace mode and still could only see this, no other logs or information.

Feb 28 '24 18:02 damiancalabresi

sirwio I don't think https://github.com/fluent/fluent-bit/issues/8503 is the same issue. In my case, chunks keep accumulating. The log message is different too. I see the same as @carlosrmendes "re-schedule retry"

Our issue when the issue was found was when using the opensearch or elasticsearch output plugins. Using those two plugin when the system run into issue we do as well see the "re-schedule retry" messages. Issue #8503 is simplified so that the maintainers and other interested parties would have a simple way to reproduce the issue...

But you are right. The issue in #8503 is most likely due to some flaw in the logic when the storage.max_chunks_up is reached. Increasing the size of the tasks_map array just makes it a tad longer to reach the storage.max_chunks_up limit.

Mar 04 '24 09:03 sirwio

I debugged the issue and this is what I found:

Fluentbit receives chunks and assigns each arriving chunk a task, until the 2048 task map is filled.
After the 2048 tasks are busy and a new chunk arrives, fluentbit fails in this line. Additionally, fluentbit doesn’t delete the chunk from memory (in case the filesystem storage is in use), and therefore the chunk takes a space from the storage.max_chunks_up permanently. <- BUG
As new chunks arrive, fluentbit keeps failing to assign them a task and fills up all the space in memory (storage.max_chunks_up is maxed out).
When the chunks that already have a task assigned try to do their job, they first must be brought into memory. Nevertheless, since the new chunks have taken all the memory, they fail with the message re-schedule retry=XXXXXXXXX XXXX in the next x seconds endlessly.

Mar 18 '24 20:03 seblaz

Our initial tests using pull request #8601 is resolving the issues we have observed in #8503. More tests and corner cases need to be executed. Do notice a drop in ingestion speed once the task limit is reached. Kind of expected and does not seem to be a blocker for us.

Mar 20 '24 08:03 sirwio

fluentbit doesn’t delete the chunk from memory (in case the filesystem storage is in use)

hi, @seblaz

What do you think was the original rationale behind Fluentbit's design to not remove chunks from memory while using file system storage, opting instead to keep them in the file system until they are confirmed as processed?

Mar 20 '24 14:03 Sylvie-Laufeydottir

What do you think was the original rationale behind Fluentbit's design to not remove chunks from memory while using file system storage, opting instead to keep them in the file system until they are confirmed as processed?

I think it's a bug, because according to the docs:

The active Chunks that are up are ready for delivery and the ones that are still receiving records. Any other remaining Chunk is in a down state, which means that it is only in the filesystem and won't be up in memory unless it is ready to be delivered.

According to that sentence, I think that the chunks that couldn't have a task assigned should be down since they are not ready to be delivered.

Mar 20 '24 14:03 seblaz

Our initial tests using pull request #8601 is resolving the issues we have observed in #8503. More tests and corner cases need to be executed. Do notice a drop in ingestion speed once the task limit is reached. Kind of expected and does not seem to be a blocker for us.

I have tested this, and after the task limit is reached the cpu usage increases with each new chunk sent to fluentbit. I think this has to do with the fact that in each flush, all the chunks that don't have a task assigned are still brought up into memory and then also down (with the fix of the pr). Therefore, this constant up and down from memory increases the cpu usage.

Ideally, I think that fluentbit should try to avoid bringing the chunk into memory if it knows that there isn't a task available. Nevertheless, when I tried to implement that before bringing the chunk into memory, then the number of chunks in the output plugins was calculated incorrectly due to flb_input_chunk_update_output_instances not being called.

If you can accept the cpu downgrade the pr fix will be okay, but I think we need a fluentbit maintainer to help us fix the hole problem.

Mar 20 '24 20:03 seblaz

@sirwio I added an additional check in the pr. So far, from my tests it's working as expected and I don't see the cpu downgrade

Mar 21 '24 19:03 seblaz

Latest version of pull request #8601 resolves all issues we have observed earlier (since 1.6). High loads via tcp input plugin sent to opensearch plugin is working without issues.

Mar 25 '24 12:03 sirwio

hi @sirwio / @seblaz could you share instructions on how to build the docker image of pr #8601

Mar 25 '24 21:03 nikhil-bhat