fluentbit simply stops when reaches 2048 tasks
Bug Report
When the output http server is offline for a while, after fluentbit reaches 2048 tasks it simply stop, printing only the message re-schedule retry=XXXXXXXXX XXXX in the next 1 seconds. Even after the http output server come back online again it does not start to send the logs (both tail and storage.backlog).
Even with the hot_reload enable, sending the SIGHUP signal does not reload/restart fluentbit process because service will shutdown when all remaining tasks are flushed, but the tasks are not doing anything, are dead.
I'm using the latest v2.2.2 version and this is my config:
service:
flush: 5
hot_reload: on
http_server: on
log_level: warn
scheduler.cap: 300
storage.path: /data/buffer/
storage.max_chunks_up: 256
storage.backlog.mem_limit: 256M
storage.delete_irrecoverable_chunks: on
pipeline:
inputs:
- name: tail
db: /data/logs.db
refresh_interval: 5
read_from_head: true
buffer_max_size: 512K
buffer_chunk_size: 256K
static_batch_size: 256M
storage.type: filesystem
multiline.parser: docker, cri
tag: <namespace>.<workload>.<container>
tag_regex: (?<workload>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container>.+)-([a-z0-9]{64})\.log$
path: /var/log/containers/*_kelvin-admin_*.log,/var/log/containers/*_kelvin_*.log,/var/log/containers/*_app_*.log
filters:
- name: record_modifier
match: "*"
remove_key: _p
- name: record_modifier
match: "*"
remove_key: stream
- name: modify
match: "*"
add: cluster ${KELVIN_CLUSTER_NAME}
- name: modify
match: "*"
add: node ${KUBERNETES_NODE_NAME}
- name: lua
match: "*"
call: tag_fields
code: |
function tag_fields(tag, timestamp, record)
tags = {}
for t in string.gmatch(tag, "[^%.]+") do
table.insert(tags, t)
end
record["namespace"] = tags[1]
record["pod"] = tags[2]
if record["namespace"] == "app" then
record["workload"] = string.gsub(tags[2], "-0$", "")
end
record["container"] = tags[3]
return 1, timestamp, record
end
- name: lua
match: "*"
call: log_fields
code: |
function log_fields(tag, timestamp, record)
cjson = require("cjson")
log = record["log"]
status, parsed_log = pcall(cjson.decode, log)
if status and type(parsed_log) == "table" then
if parsed_log["level"] then
record["level"] = string.lower(parsed_log["level"])
end
if parsed_log["logger"] then
record["logger"] = parsed_log["logger"]
end
end
return 1, timestamp, record
end
outputs:
- name: http
match: "*"
host: localhost
port: 443
workers: 5
tls: on
tls.verify: off
compress: gzip
format: json_lines
net.keepalive: off
retry_limit: no_limits
net.connect_timeout: 45
log_response_payload: false
storage.total_limit_size: 5G
What is your ulimit setting for file descriptors?
ulimit -n
I have seen similar issues when the ulimit -n setting has been 1024 which e.g. is the default on Ubuntu distributions.
This issue may be related to the one described here: https://github.com/fluent/fluent-bit/issues/8503
ulimit -n: 1048576
I'm experiencing the same issue. When the output goes down for a certain time, tasks keep retrying until exactly 2048 tasks are created.
This is likely due to poor handling of upper boundary conditions of the tasks_map array defined in flb_config.h
Increasing the size from 2048 -> 4096 increases the the limit of when my filed issue #8503 runs into issue allowing ~ 2x the amount of files to be written to the filesystem before the system stalls.
diff --git a/include/fluent-bit/flb_config.h b/include/fluent-bit/flb_config.h
index 82003386d..cb5953a27 100644
--- a/include/fluent-bit/flb_config.h
+++ b/include/fluent-bit/flb_config.h
@@ -284,7 +284,7 @@ struct flb_config {
unsigned int sched_cap;
unsigned int sched_base;
- struct flb_task_map tasks_map[2048];
+ struct flb_task_map tasks_map[4096];
int dry_run;
};
@sirwio I don't think #8503 is the same issue. In my case, chunks keep accumulating. The log message is different too. I see the same as @carlosrmendes "re-schedule retry"
I ran Fluent in Trace mode and still could only see this, no other logs or information.
sirwio I don't think https://github.com/fluent/fluent-bit/issues/8503 is the same issue. In my case, chunks keep accumulating. The log message is different too. I see the same as @carlosrmendes "re-schedule retry"
Our issue when the issue was found was when using the opensearch or elasticsearch output plugins. Using those two plugin when the system run into issue we do as well see the "re-schedule retry" messages. Issue #8503 is simplified so that the maintainers and other interested parties would have a simple way to reproduce the issue...
But you are right. The issue in #8503 is most likely due to some flaw in the logic when the storage.max_chunks_up is reached. Increasing the size of the tasks_map array just makes it a tad longer to reach the storage.max_chunks_up limit.
I debugged the issue and this is what I found:
- Fluentbit receives chunks and assigns each arriving chunk a task, until the 2048 task map is filled.
- After the 2048 tasks are busy and a new chunk arrives, fluentbit fails in this line. Additionally, fluentbit doesn’t delete the chunk from memory (in case the filesystem storage is in use), and therefore the chunk takes a space from the storage.max_chunks_up permanently. <- BUG
- As new chunks arrive, fluentbit keeps failing to assign them a task and fills up all the space in memory (storage.max_chunks_up is maxed out).
- When the chunks that already have a task assigned try to do their job, they first must be brought into memory. Nevertheless, since the new chunks have taken all the memory, they fail with the message
re-schedule retry=XXXXXXXXX XXXX in the next x secondsendlessly.
Our initial tests using pull request #8601 is resolving the issues we have observed in #8503. More tests and corner cases need to be executed. Do notice a drop in ingestion speed once the task limit is reached. Kind of expected and does not seem to be a blocker for us.
- fluentbit doesn’t delete the chunk from memory (in case the filesystem storage is in use)
hi, @seblaz
What do you think was the original rationale behind Fluentbit's design to not remove chunks from memory while using file system storage, opting instead to keep them in the file system until they are confirmed as processed?
What do you think was the original rationale behind Fluentbit's design to not remove chunks from memory while using file system storage, opting instead to keep them in the file system until they are confirmed as processed?
I think it's a bug, because according to the docs:
The active Chunks that are up are ready for delivery and the ones that are still receiving records. Any other remaining Chunk is in a down state, which means that it is only in the filesystem and won't be up in memory unless it is ready to be delivered.
According to that sentence, I think that the chunks that couldn't have a task assigned should be down since they are not ready to be delivered.
Our initial tests using pull request #8601 is resolving the issues we have observed in #8503. More tests and corner cases need to be executed. Do notice a drop in ingestion speed once the task limit is reached. Kind of expected and does not seem to be a blocker for us.
I have tested this, and after the task limit is reached the cpu usage increases with each new chunk sent to fluentbit. I think this has to do with the fact that in each flush, all the chunks that don't have a task assigned are still brought up into memory and then also down (with the fix of the pr). Therefore, this constant up and down from memory increases the cpu usage.
Ideally, I think that fluentbit should try to avoid bringing the chunk into memory if it knows that there isn't a task available. Nevertheless, when I tried to implement that before bringing the chunk into memory, then the number of chunks in the output plugins was calculated incorrectly due to flb_input_chunk_update_output_instances not being called.
If you can accept the cpu downgrade the pr fix will be okay, but I think we need a fluentbit maintainer to help us fix the hole problem.
@sirwio I added an additional check in the pr. So far, from my tests it's working as expected and I don't see the cpu downgrade
Latest version of pull request #8601 resolves all issues we have observed earlier (since 1.6). High loads via tcp input plugin sent to opensearch plugin is working without issues.
hi @sirwio / @seblaz could you share instructions on how to build the docker image of pr #8601