fluentd Tons of errors: ack in response and chunk id in sent data are different chunk

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug I experienced an issue where Fluentd (forwader) is DOSing itself (aggregator). I see the following message in the forwarder 3 Billion times: ack in response and chunk id in sent data are different chunk_id="5c11ea50501ae864ac7e1761066cb4c7" ack="XBHqgg5jcpuNc2rP8orbhA==\n".

To Reproduce None. I haven't seen this error before, but it has pop out 2x in the last 2 weeks.

Expected behavior I would expect a retry backout or just drop the chunk if it will never succeed. Since the forwarder sends FLUENT_LOG to the aggregator, it is DOSing the aggregator.

Your Environment

Fluentd or td-agent version: fluentd --version or td-agent --version

1.12.1

Operating system: cat /etc/os-release

PRETTY_NAME="Distroless"
NAME="Debian GNU/Linux"
ID="debian"
VERSION_ID="10"
VERSION="Debian GNU/Linux 10 (buster)"

Kernel version: uname -r

5.4.0-1043-azure

If you hit the problem with older fluentd version, try latest version first.

Your Configuration

<source>
  @type http
  port 9880
</source>

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-docker.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type multi_format
    <pattern>
      format json
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
    </pattern>
    <pattern>
      format regexp
      expression /^(?<time>.+) (?<stream>stdout|stderr)( (?<logtag>.))? (?<log>.*)$/
      time_format '%Y-%m-%dT%H:%M:%S.%N%:z'
    </pattern>
  </parse>
</source>

<match fluentd.healthcheck>
  @type null
</match>

<filter kubernetes.var.log.containers.**>
  @type kubernetes_metadata
</filter>

<filter kubernetes.var.log.containers.**>
  @type record_modifier
  remove_keys logtag
</filter>

# Set tag to kubernetes.<ns>
<match kubernetes.var.log.containers.**>
  @type rewrite_tag_filter
  <rule>
    key $['kubernetes']['namespace_name']
    pattern /(.*)/
    tag kubernetes.$1
  </rule>
</match>

<filter kubernetes.**>
  @type prometheus
  <metric>
    name fluentd_input_status_num_records_total
    type counter
    desc The total number of incoming records
    <labels>
      tag ${tag}
      hostname ${hostname}
    </labels>
  </metric>
</filter>

<label @FLUENT_LOG>
  <filter fluent.*>
    @type record_modifier
    <record>
      host "#{Socket.gethostname}"
      namespace "logging"
      original_tag ${tag}
    </record>
  </filter>
  <match **>
    @type relabel
    @label @FORWARD
  </match>
</label>

<match **>
  @type relabel
  @label @FORWARD
</match>

<label @FORWARD>
  <match **>
    @type copy
    <store>
      @type forward
      require_ack_response true
      expire_dns_cache 1800
      connect_timeout 30
      keepalive true
      keepalive_timeout 600
      <buffer>
        @type file
        path /var/log/td-agent/buffer/aggregator
        chunk_limit_size 4m
        total_limit_size 32GB
        flush_interval 5s
        flush_thread_count 8
        retry_timeout 6h
      </buffer>
      <server>
        host aggregator
        port 24224
      </server>
    </store>
    <store>
      @type prometheus
      <metric>
        name fluentd_output_status_num_records_total
        type counter
        desc The total number of outgoing records
        <labels>
          tag ${tag}
          hostname ${hostname}
        </labels>
      </metric>
    </store>
  </match>
</label>

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
  interval 30
  <labels>
    hostname ${hostname}
  </labels>
</source>

Your Error Log

ack in response and chunk id in sent data are different chunk_id="5c11ea50501ae864ac7e1761066cb4c7" ack="XBHqgg5jcpuNc2rP8orbhA==\n"

Additional context

May 06 '21 22:05 panaji

hi @repeatedly , what does the error mean and what could have caused this?

Jun 15 '21 01:06 panaji

hi @agup006, do you know what's causing this error?

Jul 12 '21 22:07 panaji

Do you see other error or warn logs? If you hard to find it due to the tons of errors, please try to mitigate it by ignore_repeated_log_interval or ignore_same_log_interval

Jul 13 '21 01:07 ashie

Thanks @ashie. Unfortunately, there were just too many errors to see other error/warning logs. I will try your suggestions.

Jul 13 '21 02:07 panaji

We have seen this issue randomly both in production and test environment. There is nothing in logs that would indicate what the actual root issue could be. In production our aggregators are running v1.10.3 and there is large number of forwarders with different versions. In test our aggregators are v1.14.1 and forwarders v1.10.3. It is not clear which versions are actually affected by this bug.

In general this is quite a major issue because the invalid chunk is being resend basically forever and it creates a large amount of duplicate events. Even a single misbehaving worker thread in a forwarder can cause major load to the aggregators.

Nov 10 '21 07:11 kkurten

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

Feb 08 '22 10:02 github-actions[bot]

unstale

Feb 08 '22 17:02 panaji

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

May 10 '22 10:05 github-actions[bot]

@ashie i tried your suggestion to reduce the noise, but i haven't seen this error again.

May 10 '22 17:05 panaji

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

Aug 10 '22 10:08 github-actions[bot]

This issue was automatically closed because of stale in 30 days

Sep 10 '22 10:09 github-actions[bot]

Tons of errors: ack in response and chunk id in sent data are different chunk_id="xxx" ack="yyy\n"