fluentd bignum too big to convert into `long long'

Describe the bug

Flushing the buffer fails with RangeError - "bignum too big to convert into `long long'".

To Reproduce

Not exactly repro, but this is my conf setting. I'm not sure what bignum is too big, I guess some data comes across from time to time that is bigger than long long, I guess that passes through JSON parser well, but I'm not sure why it fails in mongo buffer.

Expected behavior

Anything else would be better behavior for me, to set -1 instead of real value, to set the biggest long long value, to give some replacement somehow, to remove that row completely , anything except for whole chunk failing constantly. It seems that this is leading to fluentd getting stuck after some time.

Your Environment

- Fluentd version: 1.15-1
- TD Agent version: /
- Operating system: alpine linux, v3.16.0
- Kernel version: 5.10.0-0.bpo.15-amd64

Your Configuration

# Inputs from container logs
<source>
  @type tail
  @id in_tail_container_logs
  path /var/log/containers/*.log
  exclude_path ["/var/log/containers/cilium*"]
  pos_file /var/log/fluentd.log.pos
  read_from_head
  tag kubernetes.*
  <parse>
    @type cri
  </parse>
</source>

# Merge logs split into multiple lines
<filter kubernetes.**>
  @type concat
  key message
  use_partial_cri_logtag true
  partial_cri_logtag_key logtag
  partial_cri_stream_key stream
  separator ""
</filter>

# Enriches records with Kubernetes metadata
<filter kubernetes.**>
  @type kubernetes_metadata
</filter>

# Prettify kubernetes metadata
<filter kubernetes.**>
  @type record_transformer
  enable_ruby
  <record>
    nodeName ${record.dig("kubernetes", "host")}
    namespaceName ${record.dig("kubernetes", "namespace_name")}
    podName ${record.dig("kubernetes", "pod_name")}
    containerName ${record.dig("kubernetes", "container_name")}
    containerImage ${record.dig("kubernetes", "container_image")}
  </record>
  remove_keys docker,kubernetes
</filter>
 
# Expands inner json
<filter kubernetes.**>
  @type parser
  format json
  key_name message
  reserve_data true
  remove_key_name_field true
  emit_invalid_record_to_error false
  time_format %Y-%m-%dT%H:%M:%S.%NZ
  time_key time
  keep_time_key
</filter>

# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
  @type rename_key
  replace_rule1 \$ [dollar]
</filter>

# Mongodb keys should not have dollar or a dot inside
<filter kubernetes.**>
  @type rename_key
  replace_rule1 \. [dot]
</filter>

# Outputs to log db
<match kubernetes.**>
  @type mongo

  connection_string "#{ENV['MONGO_ANALYTICS_DB_HOST']}"
  collection logs

  <buffer>
    @type file
    path /var/log/file-buffer
    flush_thread_count 8
    flush_interval 3s
    chunk_limit_size 32M
    flush_mode interval
    retry_max_interval 60
    retry_forever true
  </buffer>
</match>

Your Error Log

2023-07-28 07:44:22 +0000 [warn]: #0 retry succeeded. chunk_id="6018738263cb959ce87d310203d5692c"
2023-07-28 07:44:23 +0000 [warn]: #0 failed to flush the buffer. retry_times=0 next_retry_time=2023-07-28 07:44:24 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:23 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:24 +0000 [warn]: #0 failed to flush the buffer. retry_times=1 next_retry_time=2023-07-28 07:44:27 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:24 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:26 +0000 [warn]: #0 failed to flush the buffer. retry_times=2 next_retry_time=2023-07-28 07:44:31 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:26 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:26 +0000 [warn]: #0 failed to flush the buffer. retry_times=2 next_retry_time=2023-07-28 07:44:31 +0000 chunk="6017a8f8ca2765e2e8eeb789257a6e08" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:26 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:30 +0000 [warn]: #0 failed to flush the buffer. retry_times=3 next_retry_time=2023-07-28 07:44:39 +0000 chunk="6017a8f8ca2765e2e8eeb789257a6e08" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:30 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:38 +0000 [warn]: #0 failed to flush the buffer. retry_times=4 next_retry_time=2023-07-28 07:44:54 +0000 chunk="6017a8f8ca2765e2e8eeb789257a6e08" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:38 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:38 +0000 [warn]: #0 failed to flush the buffer. retry_times=4 next_retry_time=2023-07-28 07:44:56 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:38 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:41 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 18, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: 5, namespace_cache_host_updates: 4, pod_cache_host_updates: 18
2023-07-28 07:44:56 +0000 [warn]: #0 failed to flush the buffer. retry_times=5 next_retry_time=2023-07-28 07:45:29 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:56 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:44:56 +0000 [warn]: #0 failed to flush the buffer. retry_times=5 next_retry_time=2023-07-28 07:45:30 +0000 chunk="6017a8f8ca2765e2e8eeb789257a6e08" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:44:56 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:45:11 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 18, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: 5, namespace_cache_host_updates: 4, pod_cache_host_updates: 18
2023-07-28 07:45:30 +0000 [warn]: #0 failed to flush the buffer. retry_times=6 next_retry_time=2023-07-28 07:46:32 +0000 chunk="5f0bc3088db314e9e88e1a6920da4a11" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:45:30 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:45:30 +0000 [warn]: #0 failed to flush the buffer. retry_times=6 next_retry_time=2023-07-28 07:46:33 +0000 chunk="6017a8f8ca2765e2e8eeb789257a6e08" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:45:30 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:45:30 +0000 [warn]: #0 failed to flush the buffer. retry_times=6 next_retry_time=2023-07-28 07:46:25 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:45:30 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:45:41 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 18, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: 5, namespace_cache_host_updates: 4, pod_cache_host_updates: 18
2023-07-28 07:46:11 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 18, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: 5, namespace_cache_host_updates: 4, pod_cache_host_updates: 18
2023-07-28 07:46:25 +0000 [warn]: #0 failed to flush the buffer. retry_times=7 next_retry_time=2023-07-28 07:47:27 +0000 chunk="6017a8f8ca2765e2e8eeb789257a6e08" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:46:25 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:46:25 +0000 [warn]: #0 failed to flush the buffer. retry_times=7 next_retry_time=2023-07-28 07:47:25 +0000 chunk="5f44a1f448d6b3feab006a95a5405527" error_class=RangeError error="bignum too big to convert into `long long'"
  2023-07-28 07:46:25 +0000 [warn]: #0 suppressed same stacktrace
2023-07-28 07:46:41 +0000 [info]: #0 stats - namespace_cache_size: 4, pod_cache_size: 18, namespace_cache_api_updates: 5, pod_cache_api_updates: 5, id_cache_miss: 5, namespace_cache_host_updates: 4, pod_cache_host_updates: 18

Additional context

It is running in digital ocean kubernetes, as DaemonSet.

Jul 28 '23 08:07 bjg2

Even more, I cannot understand all the buffer settings. If I remove retry_forever and retry_timeout of 24h (86400), I still don't get the issue chunks to be deleted, as these settings seem not to be on chunk level? I have chunks queued for months that keep failing, some even from last year, and they cannot get deleted because these settings are not on chunk level, whenever some chunk is flushed as expected retry_times, next_retry_time, everything is reset to the init state for all chunks, even the issue ones?

Jul 28 '23 12:07 bjg2

Just happened again, fluentd got completely stuck, this time in minikube. Same, it was reporting 'bignum too big' for two chunks, and when I removed the chunks and restarted fluentd, it was unstuck (simple restart didn't do it, had to remove those chunks before restart). Chunks in question are attached, pleasae advise, as I don't know what to do to mitigate this issue. logs.zip

Oct 12 '23 08:10 bjg2

This keeps happening. Any update?

Jun 13 '24 10:06 bjg2

Sorry for the late response.

Do you get these errors with specific chunks (Chunks that cause the error cause the same error repeatedly)?

Even more, I cannot understand all the buffer settings. If I remove retry_forever and retry_timeout of 24h (86400), I still don't get the issue chunks to be deleted, as these settings seem not to be on chunk level? I have chunks queued for months that keep failing, some even from last year, and they cannot get deleted because these settings are not on chunk level, whenever some chunk is flushed as expected retry_times, next_retry_time, everything is reset to the init state for all chunks, even the issue ones?

You can set retry_max_times to limit the retry count.

https://docs.fluentd.org/configuration/buffer-section#retries-parameters

Jun 14 '24 07:06 daipom

Looks like the stacktrace was omitted.

2023-07-28 07:44:23 +0000 [warn]: #0 suppressed same stacktrace

Could you please share the statcktrace? I'd like to know which code causes this error.

Jun 14 '24 08:06 daipom

When we had big issue ~ a year ago I had all the data, but it got lost in the meantime. When we had that issue the last time, 3 weeks ago, I just removed bad chunks.

Chunks that cause errors cause the same error forever. The only solution was to remove the chunk file and restart the fluentd, as mentioned settings were not on chunk level. Is retry_max_times chunk level setting? Because all the other settings were reseting as soon as 1 chunk had been uploaded successfully.

Jul 04 '24 09:07 bjg2