Tons of errors: ack in response and chunk id in sent data are different chunk_id="xxx" ack="yyy\n"
Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.
Describe the bug I experienced an issue where Fluentd (forwader) is DOSing itself (aggregator). I see the following message in the forwarder 3 Billion times: ack in response and chunk id in sent data are different chunk_id="5c11ea50501ae864ac7e1761066cb4c7" ack="XBHqgg5jcpuNc2rP8orbhA==\n".
To Reproduce None. I haven't seen this error before, but it has pop out 2x in the last 2 weeks.
Expected behavior I would expect a retry backout or just drop the chunk if it will never succeed. Since the forwarder sends FLUENT_LOG to the aggregator, it is DOSing the aggregator.
Your Environment
- Fluentd or td-agent version:
fluentd --versionortd-agent --version
1.12.1
- Operating system:
cat /etc/os-release
PRETTY_NAME="Distroless"
NAME="Debian GNU/Linux"
ID="debian"
VERSION_ID="10"
VERSION="Debian GNU/Linux 10 (buster)"
- Kernel version:
uname -r
5.4.0-1043-azure
If you hit the problem with older fluentd version, try latest version first.
Your Configuration
<source>
@type http
port 9880
</source>
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-docker.pos
tag kubernetes.*
read_from_head true
<parse>
@type multi_format
<pattern>
format json
time_format "%Y-%m-%dT%H:%M:%S.%NZ"
</pattern>
<pattern>
format regexp
expression /^(?<time>.+) (?<stream>stdout|stderr)( (?<logtag>.))? (?<log>.*)$/
time_format '%Y-%m-%dT%H:%M:%S.%N%:z'
</pattern>
</parse>
</source>
<match fluentd.healthcheck>
@type null
</match>
<filter kubernetes.var.log.containers.**>
@type kubernetes_metadata
</filter>
<filter kubernetes.var.log.containers.**>
@type record_modifier
remove_keys logtag
</filter>
# Set tag to kubernetes.<ns>
<match kubernetes.var.log.containers.**>
@type rewrite_tag_filter
<rule>
key $['kubernetes']['namespace_name']
pattern /(.*)/
tag kubernetes.$1
</rule>
</match>
<filter kubernetes.**>
@type prometheus
<metric>
name fluentd_input_status_num_records_total
type counter
desc The total number of incoming records
<labels>
tag ${tag}
hostname ${hostname}
</labels>
</metric>
</filter>
<label @FLUENT_LOG>
<filter fluent.*>
@type record_modifier
<record>
host "#{Socket.gethostname}"
namespace "logging"
original_tag ${tag}
</record>
</filter>
<match **>
@type relabel
@label @FORWARD
</match>
</label>
<match **>
@type relabel
@label @FORWARD
</match>
<label @FORWARD>
<match **>
@type copy
<store>
@type forward
require_ack_response true
expire_dns_cache 1800
connect_timeout 30
keepalive true
keepalive_timeout 600
<buffer>
@type file
path /var/log/td-agent/buffer/aggregator
chunk_limit_size 4m
total_limit_size 32GB
flush_interval 5s
flush_thread_count 8
retry_timeout 6h
</buffer>
<server>
host aggregator
port 24224
</server>
</store>
<store>
@type prometheus
<metric>
name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records
<labels>
tag ${tag}
hostname ${hostname}
</labels>
</metric>
</store>
</match>
</label>
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>
<source>
@type prometheus_output_monitor
interval 30
<labels>
hostname ${hostname}
</labels>
</source>
Your Error Log
ack in response and chunk id in sent data are different chunk_id="5c11ea50501ae864ac7e1761066cb4c7" ack="XBHqgg5jcpuNc2rP8orbhA==\n"
Additional context
hi @repeatedly , what does the error mean and what could have caused this?
hi @agup006, do you know what's causing this error?
Do you see other error or warn logs?
If you hard to find it due to the tons of errors, please try to mitigate it by ignore_repeated_log_interval or ignore_same_log_interval
Thanks @ashie. Unfortunately, there were just too many errors to see other error/warning logs. I will try your suggestions.
We have seen this issue randomly both in production and test environment. There is nothing in logs that would indicate what the actual root issue could be. In production our aggregators are running v1.10.3 and there is large number of forwarders with different versions. In test our aggregators are v1.14.1 and forwarders v1.10.3. It is not clear which versions are actually affected by this bug.
In general this is quite a major issue because the invalid chunk is being resend basically forever and it creates a large amount of duplicate events. Even a single misbehaving worker thread in a forwarder can cause major load to the aggregators.
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
unstale
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
@ashie i tried your suggestion to reduce the noise, but i haven't seen this error again.
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
This issue was automatically closed because of stale in 30 days